AISE-Bench

A Full-Cycle Curated Benchmark for Information Seeking on Academic Knowledge Graphs

Introduction

Large language models (LLMs) augmented with tools are emerging as autonomous agents capable of using Web engine, APIs, and code to solve complex, long-horizon tasks. Current tool-using bench-marks for information seeking on academic graphs rely on synthetic templates, simplified solution spaces, or narrow tasks such as paper-centric tasks, leaving key challenges underexplored - realistic user intent, complex multi-step API planning, rich parameter filling for APIs, low-hallucination answers with references, and comprehensive evaluation of both the process and the outcome. We introduce AISE-Bench, a real, full-cycle annotated benchmark for academic information seeking on academic knowledge graphs. Each sample includes a validated query taxonomy, full API execution trajectories, and source -grounded answers with reference links. To support high -quality annotation, we design an agent - assisted interface enabling annotators to plan, execute, and revise complex API workflows. We develop a comprehensive evaluation protocol measuring answer quality, reference grounding, API-planning correctness, hallucination, and execution success. Among the 14 evaluated methods, even the strongest model (PLAY2PROMPT with Gemini-3-pro-preview) achieves only moderate performance and often struggles with API planning and execution. AISE-Bench establishes a challenging new testbed for advancing more accurate, trustworthy, and interpretable multi-step API-using LLM agents.

Leaderboard

Main evaluation results on the test set.

# Model References and Formatting API-based Judge Answer Content
Precision Recall Format Edit Dist. Para. Acc. Success Correct. Complete. Faithful. F1-LM
CAW Deepseek-V3.2

DeepSeek-AI

0.3544 0.3461 0.78 1.56 0.4453 0.8267 0.4571 0.4729 0.8355 0.4649
GLM-4.7

Z.ai

0.1905 0.1659 0.4067 1.8467 0.3474 0.8533 0.3727 0.3510 0.6168 0.3615
Qwen3-235B-A22B

Alibaba

0.4416 0.3524 0.8467 1.84 0.4131 0.9133 0.4936 0.4778 0.7607 0.4856
GPT-5.2

OpenAI

0.3008 0.3167 0.8467 5.4667 0.3432 0.62 0.4368 0.4562 0.787 0.4463
Gemini-3-Pro

Google

0.4109 0.4342 0.74 1.2867 0.4242 0.7867 0.5721 0.5495 0.7907 0.5606
Claude-4.5

Anthropic

0.1564 0.1072 0.1467 1.7733 0.3632 0.7733 0.3666 0.3328 0.6705 0.3489
API-Using Agent ReAct

0.343 0.3779 0.7333 4.8267 0.2705 0.9933 0.5923 0.6015 0.7402 0.5969
AvaTaR

0.4313 0.4639 0.79 1.3867 0.3522 0.9267 0.6046 0.5894 0.826 0.5969
DRAFT

0.4199 0.4545 0.7667 1.3333 0.412 0.92 0.5873 0.5819 0.8217 0.5846
PLAY2PROMPT

0.4308 0.4881 0.8333 1.5267 0.3968 0.9 0.6104 0.609 0.8542 0.610
Coding Agent CodeAct

0.4022 0.4313 0.8 1.3467 0.4047 0.9467 0.5144 0.5123 0.9295 0.5130
SoAy

0.4275 0.4306 0.8067 1.3067 0.3934 0.9667 0.541 0.5008 0.7225 0.5201
Deep Research Agent Perplexity

Perplexity AI

/ / / / / / 0.3692 0.4251 / 0.3952
Metaso

Metaso

/ / / / / / 0.2688 0.3025 / 0.2847

AISE-Bench

Comparison of academic search Benchmarks

The Entity column denotes the type of academic entity targeted by each QA task. In the Evaluation Metrics column, LLM indicates LLM-based semantic correctness, citations indicates that the references or links used in the answer need to be evaluated, and process indicates reasoning process evaluation. In the Annotation Modules column, evidence indicates that the supporting context for the answer needs to be located in the original text. Citations indicates that the answer must provide its referenced sources. API paths indicates that the API call traces, including the API inputs and outputs, need to be annotated.

AISE-Bench construction pipeline

We first perform filtering and sampling based on real user queries from AMiner. Then, using a customized agent workflow (CAW), we generate initial API plans and answers for each query. Annotators refine the CAW-based plans and answers, and each annotated query is verified by at least one reviewer.

Distribution of retrieval questions across four dimensions

(a) User Intention: Search Org. = Search Organization; (b) Knowledge Level: Know. Mem. = Knowledge Memorization. Knowledge Understanding is further categorized into Examples, Comparison, Comprehension, and Summarization; (c) Planning Steps: number of steps of API calls. (d) Discipline: first-level discipline.

Overview of the API ecosystem

(a) API Library: Detailed specifications of available search and retrieval functions, including their types, input parameters, and return values. (b) API Graph: Schematic representation of the interaction flow between core entities (Paper, Author, Venue, and Org), where numbered labels correspond to the API IDs defined in the library.

Customized Agent Workflow

Overview of the Customized Agent Workflow (CAW) framework, which consists of three main components: a planner, a task executor, and a synthesizer. The modular multi-agent design allows annotators to independently modify or customize individual modules.

Evaluation Metrics

We propose a multi-dimensional evaluation framework covering reasoning, answer quality, and references. For API calling and planning, we measure graph edit distance and execution success rate. For references and formatting, we compare model-generated citations with gold annotations using precision and recall, and assess compliance with the required JSON output format. For answer content, we evaluate correctness and completeness, analogous to precision and recall against reference answers, and measure faithfulness by detecting inconsistencies with the API outputs.

Experiments

Results across Question Types

The F1-LM score of representative methods on different types of problems (user intent understanding and API planning difficulty).

The performance of representative methods on queries of knowledge memorization and understanding.

Case Studies

Representative case studies from the AISE-Bench test set.