From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review
PaperUnified taxonomy of ~60 benchmarks, agent framework comparison, collaboration protocols survey
From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review
Abstract
This comprehensive review consolidates fragmented efforts in evaluation benchmarks, frameworks, and collaboration protocols into a unified framework. It presents a side-by-side comparison of benchmarks developed between 2019 and 2025, and proposes a taxonomy of approximately 60 benchmarks covering general knowledge reasoning, mathematical problem-solving, code generation, and domain-specific evaluations. Reviews agent frameworks from 2023-2025 and examines real-world applications across 11 sectors.
Key Contributions
- Unified taxonomy of ~60 benchmarks categorized across 8 domains
- Comparative analysis of benchmarks developed 2019-2025
- Review of AI-agent frameworks integrating LLMs with modular tools
- Survey of agent-to-agent collaboration protocols (ACP, MCP, A2A)
- Documentation of real-world applications across 11 sectors including materials science, biomedical research, healthcare, and finance
Methodology
Systematic literature consolidation organizing fragmented evaluation efforts into a unified framework addressing multi-domain assessment needs.
Results
Coverage spans general reasoning, mathematics, code generation, factual grounding, multimodal tasks, and interactive assessments. Identifies critical gap between benchmark performance and real-world deployment robustness.
Limitations
- Identifies future research needs including failure modes and security vulnerabilities
- Automated scientific discovery challenges remain
- Gap between benchmark and real-world performance
Source: From LLM Reasoning to Autonomous AI Agents by Ferrag et al.