Agentic Tool Use in Large Language Models

Paper
Hu Jinchao et al.Harbin Institute of Technology Shenzhen, TikTok IncApril 1, 2026
Original Source
Key Contribution

Unified evolutionary framework for LLM tool use: prompting, supervised, RL paradigms

Agentic Tool Use in Large Language Models

Abstract

Systematically organizes fragmented research on how LLMs leverage external tools into a unified evolutionary framework. Consolidates the literature into three complementary methodological paradigms — prompting-based approaches, supervised fine-tuning, and reinforcement learning — while analyzing their respective strengths and limitations. Examines evaluation methodologies ranging from function-call correctness to end-to-end interactive success.

Key Contributions

  • Evolutionary synthesis tracing progression from prompt-based control through supervised learning to reward-driven optimization
  • Unified taxonomy of three distinct paradigms with explicit boundaries
  • Three-tier evaluation framework: tool-call validity, task completion, interactive performance
  • Identification of unresolved challenges: credit assignment in tool chains, scalable generalization, alignment-aware deployment

Methodology

Three paradigms analyzed:

  1. Prompting as Plug-and-Play: Frozen models via in-context learning — reasoning-action loops, decoupled planning-execution, program-aided reasoning
  2. Supervised Tool Learning: Internalizing patterns via fine-tuning — self-supervised data generation, large-scale instruction tuning, alignment-focused training
  3. Reward-Driven Tool Policy Learning: Optimizing through RL signals — strategic decisions, multi-turn reasoning, multimodal frameworks

Results

  • Prompting: excels in flexibility, suffers from latency and unreliability in complex tasks
  • Supervised: improves efficiency/stability through parameter internalization
  • RL: enables autonomous adaptation in dynamic environments
  • Production systems increasingly combine all three paradigms
  • Evaluation matured from isolated function-call metrics to holistic interactive benchmarks (WebArena, OSWorld)

Limitations

  • Primarily text-based; multimodal dimensions emerging but less developed
  • Long-horizon credit assignment underexplored in RL contexts
  • Tool generalization to unseen APIs faces persistent challenges
  • Safety and alignment considerations still underdeveloped
  • Benchmark diversity may not capture real-world deployment complexity

Source: Agentic Tool Use in Large Language Models by Hu Jinchao et al., HIT Shenzhen/TikTok

Tags

tool-usellm-agentsreinforcement-learningfine-tuningevaluation

Identifiers

Agentic Tool Use in Large Language Models | KB | MenFem