AISE-Bench

Introduction

Large language models (LLMs) augmented with tools are emerging as autonomous agents capable of using Web engine, APIs, and code to solve complex, long-horizon tasks. Current tool-using bench-marks for information seeking on academic graphs rely on synthetic templates, simplified solution spaces, or narrow tasks such as paper-centric tasks, leaving key challenges underexplored - realistic user intent, complex multi-step API planning, rich parameter filling for APIs, low-hallucination answers with references, and comprehensive evaluation of both the process and the outcome. We introduce AISE-Bench, a real, full-cycle annotated benchmark for academic information seeking on academic knowledge graphs. Each sample includes a validated query taxonomy, full API execution trajectories, and source -grounded answers with reference links. To support high -quality annotation, we design an agent - assisted interface enabling annotators to plan, execute, and revise complex API workflows. We develop a comprehensive evaluation protocol measuring answer quality, reference grounding, API-planning correctness, hallucination, and execution success. Among the 14 evaluated methods, even the strongest model (PLAY2PROMPT with Gemini-3-pro-preview) achieves only moderate performance and often struggles with API planning and execution. AISE-Bench establishes a challenging new testbed for advancing more accurate, trustworthy, and interpretable multi-step API-using LLM agents.

Leaderboard

Main evaluation results on the test set.

#	Model	References and Formatting			API-based Judge			Answer Content
#	Model	Precision	Recall	Format	Edit Dist.	Para. Acc.	Success	Correct.	Complete.	Faithful.	F1-LM
CAW	Deepseek-V3.2 DeepSeek-AI	0.3544	0.3461	0.78	1.56	0.4453	0.8267	0.4571	0.4729	0.8355	0.4649
	GLM-4.7 Z.ai	0.1905	0.1659	0.4067	1.8467	0.3474	0.8533	0.3727	0.3510	0.6168	0.3615
	Qwen3-235B-A22B Alibaba	0.4416	0.3524	0.8467	1.84	0.4131	0.9133	0.4936	0.4778	0.7607	0.4856
	GPT-5.2 OpenAI	0.3008	0.3167	0.8467	5.4667	0.3432	0.62	0.4368	0.4562	0.787	0.4463
	Gemini-3-Pro Google	0.4109	0.4342	0.74	1.2867	0.4242	0.7867	0.5721	0.5495	0.7907	0.5606
	Claude-4.5 Anthropic	0.1564	0.1072	0.1467	1.7733	0.3632	0.7733	0.3666	0.3328	0.6705	0.3489
API-Using Agent	ReAct	0.343	0.3779	0.7333	4.8267	0.2705	0.9933	0.5923	0.6015	0.7402	0.5969
	AvaTaR	0.4313	0.4639	0.79	1.3867	0.3522	0.9267	0.6046	0.5894	0.826	0.5969
	DRAFT	0.4199	0.4545	0.7667	1.3333	0.412	0.92	0.5873	0.5819	0.8217	0.5846
	PLAY2PROMPT	0.4308	0.4881	0.8333	1.5267	0.3968	0.9	0.6104	0.609	0.8542	0.610
Coding Agent	CodeAct	0.4022	0.4313	0.8	1.3467	0.4047	0.9467	0.5144	0.5123	0.9295	0.5130
	SoAy	0.4275	0.4306	0.8067	1.3067	0.3934	0.9667	0.541	0.5008	0.7225	0.5201
Deep Research Agent	Perplexity Perplexity AI	/	/	/	/	/	/	0.3692	0.4251	/	0.3952
	Metaso Metaso	/	/	/	/	/	/	0.2688	0.3025	/	0.2847

Comparison of academic search Benchmarks

The Entity column denotes the type of academic entity targeted by each QA task. In the Evaluation Metrics column, LLM indicates LLM-based semantic correctness, citations indicates that the references or links used in the answer need to be evaluated, and process indicates reasoning process evaluation. In the Annotation Modules column, evidence indicates that the supporting context for the answer needs to be located in the original text. Citations indicates that the answer must provide its referenced sources. API paths indicates that the API call traces, including the API inputs and outputs, need to be annotated.

AISE-Bench construction pipeline

We first perform filtering and sampling based on real user queries from AMiner. Then, using a customized agent workflow (CAW), we generate initial API plans and answers for each query. Annotators refine the CAW-based plans and answers, and each annotated query is verified by at least one reviewer.

Distribution of retrieval questions across four dimensions

(a) User Intention: Search Org. = Search Organization; (b) Knowledge Level: Know. Mem. = Knowledge Memorization. Knowledge Understanding is further categorized into Examples, Comparison, Comprehension, and Summarization; (c) Planning Steps: number of steps of API calls. (d) Discipline: first-level discipline.

Overview of the API ecosystem

(a) API Library: Detailed specifications of available search and retrieval functions, including their types, input parameters, and return values. (b) API Graph: Schematic representation of the interaction flow between core entities (Paper, Author, Venue, and Org), where numbered labels correspond to the API IDs defined in the library.

Customized Agent Workflow

Overview of the Customized Agent Workflow (CAW) framework, which consists of three main components: a planner, a task executor, and a synthesizer. The modular multi-agent design allows annotators to independently modify or customize individual modules.

Evaluation Metrics

We propose a multi-dimensional evaluation framework covering reasoning, answer quality, and references. For API calling and planning, we measure graph edit distance and execution success rate. For references and formatting, we compare model-generated citations with gold annotations using precision and recall, and assess compliance with the required JSON output format. For answer content, we evaluate correctness and completeness, analogous to precision and recall against reference answers, and measure faithfulness by detecting inconsistencies with the API outputs.

Results across Question Types

The F1-LM score of representative methods on different types of problems (user intent understanding and API planning difficulty).

The performance of representative methods on queries of knowledge memorization and understanding.

Case Studies

Representative case studies from the AISE-Bench test set.

AISE-Bench

A Full-Cycle Curated Benchmark for Information Seeking on Academic Knowledge Graphs

Introduction

Leaderboard

AISE-Bench

Comparison of academic search Benchmarks

AISE-Bench construction pipeline

Distribution of retrieval questions across four dimensions

Overview of the API ecosystem

Customized Agent Workflow

Evaluation Metrics

Experiments

Results across Question Types

Case Studies