PAPER2026-04-11·University of Minnesota Twin Cities, Yonsei University, Google DeepMind·arXiv 2604.10261

The Amazing Agent Race: Strong Tool Users, Weak Navigators

Zae Myung Kim, Dongseok Lee, Jaehyung Kim, Vipul Raheja, Dongyeop Kang

COMPILED NOTES

DAG-structured benchmark of 1,400 Wikipedia navigation tasks revealing that current best agents achieve only 37.2% accuracy with navigation errors dominating (27–52% of failures), exposing compositional reasoning as the primary frontier bottleneck.

The Amazing Agent Race: Strong Tool Users, Weak Navigators

Abstract

The paper introduces a benchmark featuring directed acyclic graph (DAG) puzzles where agents must navigate Wikipedia, execute multi-step tool chains, and aggregate results. The 1,400-instance dataset reveals that current agents achieve only 37.2% accuracy, with navigation errors dominating (27–52% of failures) while tool-use errors remain minimal (below 17%). The benchmark demonstrates that current agents excel at individual tool execution but fail at compositional multi-hop navigation requiring fork-merge reasoning patterns.

Key Contributions

Compositionality gap analysis: Shows 55–100% linearity across six existing benchmarks, identifying a structural gap that prior benchmarks cannot expose
Automated generation pipeline: Produces 1,400 DAG-structured instances with fork-merge diamond patterns and four difficulty levels
Three decomposed metrics: Finish-line accuracy (FLA), pit-stop visit rate (PVR), and roadblock completion rate (RCR) isolate failures at distinct pipeline stages
Comprehensive evaluation: Across three agent frameworks demonstrating architecture matters as much as model scale; Claude Code matches Codex CLI at 37% with 6× fewer tokens

Methodology

Benchmark presents "legs" (problem instances) as riddle-style clue envelopes without revealing Wikipedia titles or tool names. Agents receive a seed URL, 19 tools with schemas, and a step budget. Ground-truth execution traces validate solvability via live APIs, ensuring answers cannot be memorized. Diamond patterns (source→branches→merge) create non-linear dependencies absent from prior work. Four difficulty levels correspond to increasing DAG depth and branching complexity.

Results

Best accuracy: 37.2% across 1,400 instances
Navigation bottleneck: PVR (pit-stop visit rate) drops 13–18 percentage points moving from linear to compositional structures
Tool-use competence: RCR (roadblock completion rate) remains stable despite 3× longer chains, indicating navigation—not tool composition—drives failures
Architecture efficiency: Claude Code (6× fewer tokens) matches Codex CLI accuracy at 37%, showing token efficiency decouples from task performance
Navigation errors: 27–52% of failures attributed to navigation; tool-use errors remain below 17% across all models

Limitations

Wikipedia is the sole navigation source; expansion to broader domains recommended
DAG topologies limited to diamond patterns; shared sub-expressions and conditional branches suggested as future work
Shortcut solutions occur on 14–21% of DAG trials, with agents inferring answers without visiting required pages, potentially inflating perceived competence
High-difficulty legs show disproportionate shortcut rates (88% on extreme DAG), reducing benchmark discrimination when genuine navigation is required
Clue envelope design leaks tool-chain structure, enabling some navigation bypass

Source: The Amazing Agent Race by Zae Myung Kim et al., University of Minnesota / Yonsei University / Google DeepMind

RELATED · IN THE BASE

Abstract

Key Contributions

Compositionality gap analysis: Shows 55–100% linearity across six existing benchmarks, identifying a structural gap that prior benchmarks cannot expose

Automated generation pipeline: Produces 1,400 DAG-structured instances with fork-merge diamond patterns and four difficulty levels

Three decomposed metrics: Finish-line accuracy (FLA), pit-stop visit rate (PVR), and roadblock completion rate (RCR) isolate failures at distinct pipeline stages

Comprehensive evaluation: Across three agent frameworks demonstrating architecture matters as much as model scale; Claude Code matches Codex CLI at 37% with 6× fewer tokens

Methodology

Results

Best accuracy: 37.2% across 1,400 instances

Navigation bottleneck: PVR (pit-stop visit rate) drops 13–18 percentage points moving from linear to compositional structures

Tool-use competence: RCR (roadblock completion rate) remains stable despite 3× longer chains, indicating navigation—not tool composition—drives failures

Architecture efficiency: Claude Code (6× fewer tokens) matches Codex CLI accuracy at 37%, showing token efficiency decouples from task performance

Navigation errors: 27–52% of failures attributed to navigation; tool-use errors remain below 17% across all models

Limitations

Wikipedia is the sole navigation source; expansion to broader domains recommended

DAG topologies limited to diamond patterns; shared sub-expressions and conditional branches suggested as future work

Shortcut solutions occur on 14–21% of DAG trials, with agents inferring answers without visiting required pages, potentially inflating perceived competence

High-difficulty legs show disproportionate shortcut rates (88% on extreme DAG), reducing benchmark discrimination when genuine navigation is required

Clue envelope design leaks tool-chain structure, enabling some navigation bypass

Source: The Amazing Agent Race by Zae Myung Kim et al., University of Minnesota / Yonsei University / Google DeepMind