Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey
PaperRui Shao, Wei Li, Lingsen Zhang et al.August 18, 2025
Original SourceKey Contribution
Taxonomy of VLM-based VLA architectures (monolithic vs hierarchical) with RL, world model, and human video integration
Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey
Abstract
Traditional rule-based robotic methods struggle in unstructured environments requiring adaptive behavior. Vision-Language-Action models built on large pretrained VLMs represent a transformative paradigm for robotic manipulation requiring precise motor control and multimodal understanding. This survey systematically examines VLA architectures, training strategies, integration with complementary learning paradigms, and emerging capabilities for next-generation robotic systems.
Key Contributions
- Architectural taxonomy distinguishing monolithic models (single/dual-system designs) from hierarchical models that explicitly decouple planning from execution via interpretable intermediate representations
- Analysis of how VLA models integrate reinforcement learning, training-free optimization, human video learning, and world models
- Systematic consolidation of architectural characteristics, operational strengths, and relevant datasets/benchmarks
- Identification of emerging capabilities: memory mechanisms, 4D perception, efficient adaptation, multi-agent cooperation
Architecture Taxonomy
Monolithic Models
- Single-system: End-to-end models mapping observations directly to actions
- Dual-system: Fast reactive system + slow deliberative system inspired by cognitive architecture
Hierarchical Models
- Explicit separation of high-level planning and low-level execution
- Interpretable intermediate representations (language subgoals, waypoints, affordance maps)
- Greater modularity and debuggability at cost of integration complexity
Integration with Learning Paradigms
Reinforcement Learning
- RL fine-tuning for contact-rich manipulation tasks
- Reward shaping using VLM-generated feedback signals
- Sim-to-real transfer with domain randomization
World Models
- Predictive models enabling look-ahead planning
- Video prediction for action consequence estimation
- Physics-informed world models for manipulation planning
Human Video Learning
- Learning manipulation primitives from internet-scale human demonstrations
- Cross-embodiment transfer from human to robot morphology
- Action segmentation and correspondence discovery
Open Challenges
- Memory mechanisms for long-horizon task execution
- 4D spatiotemporal perception for dynamic environments
- Efficient adaptation to novel objects and tasks
- Multi-agent cooperative manipulation
Tags
multimodalvision-language-actionrobotic-manipulationworld-models