LLM Tool Use

Active Frontier
tool-useagentsparadigms

LLM Tool Use

Tool use is the key mechanism that operationalizes agentic action — it transforms LLMs from text generators into systems that can interact with external APIs, databases, code interpreters, and the physical world. Hu et al. provide a unified evolutionary framework organizing this rapidly growing field into three complementary paradigms.

Paradigm 1 — Prompting as Plug-and-Play: Frozen models guided via in-context learning through reasoning-action loops, decoupled planning-execution, or program-aided reasoning. Excels in flexibility but suffers from latency and unreliability in complex tasks.

Paradigm 2 — Supervised Tool Learning: Internalizing tool-use patterns through fine-tuning using self-supervised data generation, large-scale instruction tuning, and alignment-focused training. Improves efficiency and stability through parameter internalization.

Paradigm 3 — Reward-Driven Tool Policy Learning: Optimizing interaction through reinforcement signals covering strategic decisions, multi-turn reasoning, and multimodal frameworks. Enables autonomous adaptation in dynamic environments.

The field has evolved sequentially through these paradigms, but modern production systems increasingly combine all three rather than relying on any single approach.

Key Claims

  • Three paradigms structure all LLM tool use — Prompting, supervised fine-tuning, and RL form complementary approaches with explicit trade-offs. Evidence: strong (Agentic Tool Use in LLMs)
  • Prompting excels in flexibility but fails on complex tasks — Latency and unreliability increase with task complexity. Evidence: strong (Agentic Tool Use in LLMs)
  • Production combines all three paradigms — No single paradigm dominates; real systems use hybrid approaches. Evidence: strong (Agentic Tool Use in LLMs)
  • Tool use is the foundational layer of agentic reasoning — Planning and search capabilities depend on reliable tool access. Evidence: strong (Agentic Reasoning for LLMs)

Benchmarks & Data

  • Evaluation matured from isolated function-call metrics to holistic interactive benchmarks (WebArena, OSWorld) (Hu et al.)
  • Three-tier evaluation framework: tool-call validity → task completion → interactive performance (Hu et al.)

Open Questions

  • How to solve credit assignment in lengthy tool chains (which tool call was responsible for success/failure)?
  • Can models generalize tool use to completely unseen APIs without examples?
  • How to balance alignment and capability in tool-augmented agents?
  • How to extend tool use robustly into multimodal settings (vision tools, physical actuators)?

Related Concepts

Backlinks

Pages that reference this concept:

LLM Tool Use | KB | MenFem