Agentic Tool Use in Large Language Models
PaperHu Jinchao et al.Harbin Institute of Technology Shenzhen, TikTok IncApril 1, 2026
Original SourceKey Contribution
Unified evolutionary framework for LLM tool use: prompting, supervised, RL paradigms
Agentic Tool Use in Large Language Models
Abstract
Systematically organizes fragmented research on how LLMs leverage external tools into a unified evolutionary framework. Consolidates the literature into three complementary methodological paradigms — prompting-based approaches, supervised fine-tuning, and reinforcement learning — while analyzing their respective strengths and limitations. Examines evaluation methodologies ranging from function-call correctness to end-to-end interactive success.
Key Contributions
- Evolutionary synthesis tracing progression from prompt-based control through supervised learning to reward-driven optimization
- Unified taxonomy of three distinct paradigms with explicit boundaries
- Three-tier evaluation framework: tool-call validity, task completion, interactive performance
- Identification of unresolved challenges: credit assignment in tool chains, scalable generalization, alignment-aware deployment
Methodology
Three paradigms analyzed:
- Prompting as Plug-and-Play: Frozen models via in-context learning — reasoning-action loops, decoupled planning-execution, program-aided reasoning
- Supervised Tool Learning: Internalizing patterns via fine-tuning — self-supervised data generation, large-scale instruction tuning, alignment-focused training
- Reward-Driven Tool Policy Learning: Optimizing through RL signals — strategic decisions, multi-turn reasoning, multimodal frameworks
Results
- Prompting: excels in flexibility, suffers from latency and unreliability in complex tasks
- Supervised: improves efficiency/stability through parameter internalization
- RL: enables autonomous adaptation in dynamic environments
- Production systems increasingly combine all three paradigms
- Evaluation matured from isolated function-call metrics to holistic interactive benchmarks (WebArena, OSWorld)
Limitations
- Primarily text-based; multimodal dimensions emerging but less developed
- Long-horizon credit assignment underexplored in RL contexts
- Tool generalization to unseen APIs faces persistent challenges
- Safety and alignment considerations still underdeveloped
- Benchmark diversity may not capture real-world deployment complexity
Source: Agentic Tool Use in Large Language Models by Hu Jinchao et al., HIT Shenzhen/TikTok
Tags
tool-usellm-agentsreinforcement-learningfine-tuningevaluation