Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey

Paper
Rui Shao, Wei Li, Lingsen Zhang et al.August 18, 2025
Original Source
Key Contribution

Taxonomy of VLM-based VLA architectures (monolithic vs hierarchical) with RL, world model, and human video integration

Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey

Abstract

Traditional rule-based robotic methods struggle in unstructured environments requiring adaptive behavior. Vision-Language-Action models built on large pretrained VLMs represent a transformative paradigm for robotic manipulation requiring precise motor control and multimodal understanding. This survey systematically examines VLA architectures, training strategies, integration with complementary learning paradigms, and emerging capabilities for next-generation robotic systems.

Key Contributions

  • Architectural taxonomy distinguishing monolithic models (single/dual-system designs) from hierarchical models that explicitly decouple planning from execution via interpretable intermediate representations
  • Analysis of how VLA models integrate reinforcement learning, training-free optimization, human video learning, and world models
  • Systematic consolidation of architectural characteristics, operational strengths, and relevant datasets/benchmarks
  • Identification of emerging capabilities: memory mechanisms, 4D perception, efficient adaptation, multi-agent cooperation

Architecture Taxonomy

Monolithic Models

  • Single-system: End-to-end models mapping observations directly to actions
  • Dual-system: Fast reactive system + slow deliberative system inspired by cognitive architecture

Hierarchical Models

  • Explicit separation of high-level planning and low-level execution
  • Interpretable intermediate representations (language subgoals, waypoints, affordance maps)
  • Greater modularity and debuggability at cost of integration complexity

Integration with Learning Paradigms

Reinforcement Learning

  • RL fine-tuning for contact-rich manipulation tasks
  • Reward shaping using VLM-generated feedback signals
  • Sim-to-real transfer with domain randomization

World Models

  • Predictive models enabling look-ahead planning
  • Video prediction for action consequence estimation
  • Physics-informed world models for manipulation planning

Human Video Learning

  • Learning manipulation primitives from internet-scale human demonstrations
  • Cross-embodiment transfer from human to robot morphology
  • Action segmentation and correspondence discovery

Open Challenges

  • Memory mechanisms for long-horizon task execution
  • 4D spatiotemporal perception for dynamic environments
  • Efficient adaptation to novel objects and tasks
  • Multi-agent cooperative manipulation

Tags

multimodalvision-language-actionrobotic-manipulationworld-models

Identifiers

Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey | KB | MenFem