Agentic Tool Use in Large Language Models

Paper

Hu Jinchao et al.Harbin Institute of Technology Shenzhen, TikTok IncApril 1, 2026

Key Contribution

Unified evolutionary framework for LLM tool use: prompting, supervised, RL paradigms

Agentic Tool Use in Large Language Models

Abstract

Systematically organizes fragmented research on how LLMs leverage external tools into a unified evolutionary framework. Consolidates the literature into three complementary methodological paradigms — prompting-based approaches, supervised fine-tuning, and reinforcement learning — while analyzing their respective strengths and limitations. Examines evaluation methodologies ranging from function-call correctness to end-to-end interactive success.

Key Contributions

Evolutionary synthesis tracing progression from prompt-based control through supervised learning to reward-driven optimization
Unified taxonomy of three distinct paradigms with explicit boundaries
Three-tier evaluation framework: tool-call validity, task completion, interactive performance
Identification of unresolved challenges: credit assignment in tool chains, scalable generalization, alignment-aware deployment

Methodology

Three paradigms analyzed:

Prompting as Plug-and-Play: Frozen models via in-context learning — reasoning-action loops, decoupled planning-execution, program-aided reasoning
Supervised Tool Learning: Internalizing patterns via fine-tuning — self-supervised data generation, large-scale instruction tuning, alignment-focused training
Reward-Driven Tool Policy Learning: Optimizing through RL signals — strategic decisions, multi-turn reasoning, multimodal frameworks

Results

Prompting: excels in flexibility, suffers from latency and unreliability in complex tasks
Supervised: improves efficiency/stability through parameter internalization
RL: enables autonomous adaptation in dynamic environments
Production systems increasingly combine all three paradigms
Evaluation matured from isolated function-call metrics to holistic interactive benchmarks (WebArena, OSWorld)

Limitations

Primarily text-based; multimodal dimensions emerging but less developed
Long-horizon credit assignment underexplored in RL contexts
Tool generalization to unseen APIs faces persistent challenges
Safety and alignment considerations still underdeveloped
Benchmark diversity may not capture real-world deployment complexity

Source: Agentic Tool Use in Large Language Models by Hu Jinchao et al., HIT Shenzhen/TikTok

Identifiers

arXiv:2604.00835

Agentic Tool Use in Large Language Models

Agentic Tool Use in Large Language Models

Abstract

Key Contributions

Methodology

Results

Limitations

Tags

Identifiers