From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review

Paper

Mohamed Amine Ferrag et al.March 6, 2026

Key Contribution

Unified taxonomy of ~60 benchmarks, agent framework comparison, collaboration protocols survey

From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review

Abstract

This comprehensive review consolidates fragmented efforts in evaluation benchmarks, frameworks, and collaboration protocols into a unified framework. It presents a side-by-side comparison of benchmarks developed between 2019 and 2025, and proposes a taxonomy of approximately 60 benchmarks covering general knowledge reasoning, mathematical problem-solving, code generation, and domain-specific evaluations. Reviews agent frameworks from 2023-2025 and examines real-world applications across 11 sectors.

Key Contributions

Unified taxonomy of ~60 benchmarks categorized across 8 domains
Comparative analysis of benchmarks developed 2019-2025
Review of AI-agent frameworks integrating LLMs with modular tools
Survey of agent-to-agent collaboration protocols (ACP, MCP, A2A)
Documentation of real-world applications across 11 sectors including materials science, biomedical research, healthcare, and finance

Methodology

Systematic literature consolidation organizing fragmented evaluation efforts into a unified framework addressing multi-domain assessment needs.

Results

Coverage spans general reasoning, mathematics, code generation, factual grounding, multimodal tasks, and interactive assessments. Identifies critical gap between benchmark performance and real-world deployment robustness.

Limitations

Identifies future research needs including failure modes and security vulnerabilities
Automated scientific discovery challenges remain
Gap between benchmark and real-world performance

Source: From LLM Reasoning to Autonomous AI Agents by Ferrag et al.

Identifiers

arXiv:2504.19678

From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review

From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review

Abstract

Key Contributions

Methodology

Results

Limitations

Tags

Identifiers