The Amazing Agent Race: Strong Tool Users, Weak Navigators
DAG-structured benchmark of 1,400 Wikipedia navigation tasks revealing that current best agents achieve only 37.2% accuracy with navigation errors dominating (27–52% of failures), exposing compositional reasoning as the primary frontier bottleneck.
The Amazing Agent Race: Strong Tool Users, Weak Navigators
Abstract
The paper introduces a benchmark featuring directed acyclic graph (DAG) puzzles where agents must navigate Wikipedia, execute multi-step tool chains, and aggregate results. The 1,400-instance dataset reveals that current agents achieve only 37.2% accuracy, with navigation errors dominating (27–52% of failures) while tool-use errors remain minimal (below 17%). The benchmark demonstrates that current agents excel at individual tool execution but fail at compositional multi-hop navigation requiring fork-merge reasoning patterns.
Key Contributions
- Compositionality gap analysis: Shows 55–100% linearity across six existing benchmarks, identifying a structural gap that prior benchmarks cannot expose
- Automated generation pipeline: Produces 1,400 DAG-structured instances with fork-merge diamond patterns and four difficulty levels
- Three decomposed metrics: Finish-line accuracy (FLA), pit-stop visit rate (PVR), and roadblock completion rate (RCR) isolate failures at distinct pipeline stages
- Comprehensive evaluation: Across three agent frameworks demonstrating architecture matters as much as model scale; Claude Code matches Codex CLI at 37% with 6× fewer tokens
Methodology
Benchmark presents "legs" (problem instances) as riddle-style clue envelopes without revealing Wikipedia titles or tool names. Agents receive a seed URL, 19 tools with schemas, and a step budget. Ground-truth execution traces validate solvability via live APIs, ensuring answers cannot be memorized. Diamond patterns (source→branches→merge) create non-linear dependencies absent from prior work. Four difficulty levels correspond to increasing DAG depth and branching complexity.
Results
- Best accuracy: 37.2% across 1,400 instances
- Navigation bottleneck: PVR (pit-stop visit rate) drops 13–18 percentage points moving from linear to compositional structures
- Tool-use competence: RCR (roadblock completion rate) remains stable despite 3× longer chains, indicating navigation—not tool composition—drives failures
- Architecture efficiency: Claude Code (6× fewer tokens) matches Codex CLI accuracy at 37%, showing token efficiency decouples from task performance
- Navigation errors: 27–52% of failures attributed to navigation; tool-use errors remain below 17% across all models
Limitations
- Wikipedia is the sole navigation source; expansion to broader domains recommended
- DAG topologies limited to diamond patterns; shared sub-expressions and conditional branches suggested as future work
- Shortcut solutions occur on 14–21% of DAG trials, with agents inferring answers without visiting required pages, potentially inflating perceived competence
- High-difficulty legs show disproportionate shortcut rates (88% on extreme DAG), reducing benchmark discrimination when genuine navigation is required
- Clue envelope design leaks tool-chain structure, enabling some navigation bypass
Source: The Amazing Agent Race by Zae Myung Kim et al., University of Minnesota / Yonsei University / Google DeepMind