LLM Tool Use

Active Frontier

tool-useagentsparadigms

LLM Tool Use

Tool use is the key mechanism that operationalizes agentic action — it transforms LLMs from text generators into systems that can interact with external APIs, databases, code interpreters, and the physical world. Hu et al. provide a unified evolutionary framework organizing this rapidly growing field into three complementary paradigms.

Paradigm 1 — Prompting as Plug-and-Play: Frozen models guided via in-context learning through reasoning-action loops, decoupled planning-execution, or program-aided reasoning. Excels in flexibility but suffers from latency and unreliability in complex tasks.

Paradigm 2 — Supervised Tool Learning: Internalizing tool-use patterns through fine-tuning using self-supervised data generation, large-scale instruction tuning, and alignment-focused training. Improves efficiency and stability through parameter internalization.

Paradigm 3 — Reward-Driven Tool Policy Learning: Optimizing interaction through reinforcement signals covering strategic decisions, multi-turn reasoning, and multimodal frameworks. Enables autonomous adaptation in dynamic environments.

The field has evolved sequentially through these paradigms, but modern production systems increasingly combine all three rather than relying on any single approach.

Key Claims

Three paradigms structure all LLM tool use — Prompting, supervised fine-tuning, and RL form complementary approaches with explicit trade-offs. Evidence: strong (Agentic Tool Use in LLMs)
Prompting excels in flexibility but fails on complex tasks — Latency and unreliability increase with task complexity. Evidence: strong (Agentic Tool Use in LLMs)
Production combines all three paradigms — No single paradigm dominates; real systems use hybrid approaches. Evidence: strong (Agentic Tool Use in LLMs)
Tool use is the foundational layer of agentic reasoning — Planning and search capabilities depend on reliable tool access. Evidence: strong (Agentic Reasoning for LLMs)

Benchmarks & Data

Evaluation matured from isolated function-call metrics to holistic interactive benchmarks (WebArena, OSWorld) (Hu et al.)
Three-tier evaluation framework: tool-call validity → task completion → interactive performance (Hu et al.)

The Navigation Gap

April 2026 research reveals a critical distinction that the three-paradigm framework alone cannot capture: agents are strong tool executors but weak navigators. The Amazing Agent Race benchmark (Kim et al.) introduces DAG-structured tasks requiring fork-merge reasoning — agents must branch independently, gather information from multiple paths, and aggregate results. This is compositionally harder than linear tool chains tested by prior benchmarks.

Findings: best agents achieve only 37.2% accuracy on compositional navigation, despite near-perfect tool-use reliability. Navigation errors account for 27–52% of failures; tool-use errors stay below 17% even with 3× longer chains. Six prior benchmarks average 55–100% linearity, making compositional failures structurally invisible. The implication is that the paradigm-3 RL approach improves tool-call quality but does not address the navigation planning layer — these are orthogonal capability gaps requiring different solutions.

Architecture matters as much as scale: Claude Code matches Codex CLI at 37% accuracy while using 6× fewer tokens, decoupling token efficiency from task performance.

Key Claims

Three paradigms structure all LLM tool use — Prompting, supervised fine-tuning, and RL form complementary approaches with explicit trade-offs. Evidence: strong (Agentic Tool Use in LLMs)
Prompting excels in flexibility but fails on complex tasks — Latency and unreliability increase with task complexity. Evidence: strong (Agentic Tool Use in LLMs)
Production combines all three paradigms — No single paradigm dominates; real systems use hybrid approaches. Evidence: strong (Agentic Tool Use in LLMs)
Tool use is the foundational layer of agentic reasoning — Planning and search capabilities depend on reliable tool access. Evidence: strong (Agentic Reasoning for LLMs)
Navigation is the primary frontier bottleneck, not tool execution — Agents achieve near-perfect tool-use reliability but only 37.2% on compositional navigation tasks. Evidence: strong (Amazing Agent Race)
Prior benchmarks cannot expose the compositionality gap — 55–100% linearity in six existing benchmarks hides the failure mode that compositional tasks expose. Evidence: strong (Amazing Agent Race)

Benchmarks & Data

Evaluation matured from isolated function-call metrics to holistic interactive benchmarks (WebArena, OSWorld) (Hu et al.)
Three-tier evaluation framework: tool-call validity → task completion → interactive performance (Hu et al.)
Best agent accuracy on compositional navigation: 37.2% across 1,400 DAG instances (Kim et al.)
Navigation errors: 27–52% of failures; tool-use errors: <17% across all models (Kim et al.)
Linearity of existing benchmarks: 55–100% — compositional failures are structurally invisible (Kim et al.)

Open Questions

How to solve credit assignment in lengthy tool chains (which tool call was responsible for success/failure)?
Can models generalize tool use to completely unseen APIs without examples?
How to balance alignment and capability in tool-augmented agents?
How to extend tool use robustly into multimodal settings (vision tools, physical actuators)?
Can agents be trained specifically on fork-merge reasoning patterns to close the navigation gap?
Do compositional navigation failures generalize beyond Wikipedia to broader multi-hop retrieval tasks?

Related Concepts

Agentic Reasoning — Tool use is the action mechanism within the agentic framework
Reinforcement Learning for Agents — The third paradigm for learning tool policies
Agent Evaluation Benchmarks — How tool use capabilities are measured
Tool-Chain Navigation — The compositional reasoning layer that tool execution alone cannot address

Backlinks

Pages that reference this concept:

Changelog

2026-04-14 — Added navigation gap section from Kim et al. 2026 (2604.10261); compositional bottleneck finding, benchmark linearity gap
2026-04-05 — Initial compilation from Hu et al., Wei et al., Ferrag et al. tool-use surveys

Related Concepts

Theses that depend on this concept

These research positions cite this concept in their evidence. If the concept changes materially, these theses may need re-scoring.

T1never reviewed

Agentic reasoning will consolidate around a standard stack (VLM + tool use + memory + RL) by end of 2027

6.0/10

no history yet

Test Your Understanding

AI Concepts Speed Round

Quick-fire recall on AI research concepts, aliases, and key definitions

Rapid Fire·Beginner·3m

Sources

agentic-tool-use-in-llms agentic-reasoning-for-llms llm-reasoning-to-autonomous-agents amazing-agent-race-tool-users-weak-navigators

LLM Tool Use

Active Frontier

tool-useagentsparadigms

LLM Tool Use

The field has evolved sequentially through these paradigms, but modern production systems increasingly combine all three rather than relying on any single approach.

Key Claims

Three paradigms structure all LLM tool use — Prompting, supervised fine-tuning, and RL form complementary approaches with explicit trade-offs. Evidence: strong (Agentic Tool Use in LLMs)
Prompting excels in flexibility but fails on complex tasks — Latency and unreliability increase with task complexity. Evidence: strong (Agentic Tool Use in LLMs)
Production combines all three paradigms — No single paradigm dominates; real systems use hybrid approaches. Evidence: strong (Agentic Tool Use in LLMs)
Tool use is the foundational layer of agentic reasoning — Planning and search capabilities depend on reliable tool access. Evidence: strong (Agentic Reasoning for LLMs)

Benchmarks & Data

Evaluation matured from isolated function-call metrics to holistic interactive benchmarks (WebArena, OSWorld) (Hu et al.)
Three-tier evaluation framework: tool-call validity → task completion → interactive performance (Hu et al.)

The Navigation Gap

Architecture matters as much as scale: Claude Code matches Codex CLI at 37% accuracy while using 6× fewer tokens, decoupling token efficiency from task performance.

Key Claims

Three paradigms structure all LLM tool use — Prompting, supervised fine-tuning, and RL form complementary approaches with explicit trade-offs. Evidence: strong (Agentic Tool Use in LLMs)
Prompting excels in flexibility but fails on complex tasks — Latency and unreliability increase with task complexity. Evidence: strong (Agentic Tool Use in LLMs)
Production combines all three paradigms — No single paradigm dominates; real systems use hybrid approaches. Evidence: strong (Agentic Tool Use in LLMs)
Tool use is the foundational layer of agentic reasoning — Planning and search capabilities depend on reliable tool access. Evidence: strong (Agentic Reasoning for LLMs)
Navigation is the primary frontier bottleneck, not tool execution — Agents achieve near-perfect tool-use reliability but only 37.2% on compositional navigation tasks. Evidence: strong (Amazing Agent Race)
Prior benchmarks cannot expose the compositionality gap — 55–100% linearity in six existing benchmarks hides the failure mode that compositional tasks expose. Evidence: strong (Amazing Agent Race)

Benchmarks & Data

Evaluation matured from isolated function-call metrics to holistic interactive benchmarks (WebArena, OSWorld) (Hu et al.)
Three-tier evaluation framework: tool-call validity → task completion → interactive performance (Hu et al.)
Best agent accuracy on compositional navigation: 37.2% across 1,400 DAG instances (Kim et al.)
Navigation errors: 27–52% of failures; tool-use errors: <17% across all models (Kim et al.)
Linearity of existing benchmarks: 55–100% — compositional failures are structurally invisible (Kim et al.)

Open Questions

How to solve credit assignment in lengthy tool chains (which tool call was responsible for success/failure)?
Can models generalize tool use to completely unseen APIs without examples?
How to balance alignment and capability in tool-augmented agents?
How to extend tool use robustly into multimodal settings (vision tools, physical actuators)?
Can agents be trained specifically on fork-merge reasoning patterns to close the navigation gap?
Do compositional navigation failures generalize beyond Wikipedia to broader multi-hop retrieval tasks?

Related Concepts

Agentic Reasoning — Tool use is the action mechanism within the agentic framework
Reinforcement Learning for Agents — The third paradigm for learning tool policies
Agent Evaluation Benchmarks — How tool use capabilities are measured
Tool-Chain Navigation — The compositional reasoning layer that tool execution alone cannot address

Backlinks

Pages that reference this concept:

Changelog

2026-04-14 — Added navigation gap section from Kim et al. 2026 (2604.10261); compositional bottleneck finding, benchmark linearity gap
2026-04-05 — Initial compilation from Hu et al., Wei et al., Ferrag et al. tool-use surveys