A Survey on Efficient Vision-Language-Action Models
PaperZhaoshu Yu, Bo Wang, Pengpeng Zeng et al.October 27, 2025
Original SourceKey Contribution
First comprehensive taxonomy for VLA efficiency across model design, training, and data collection pillars
A Survey on Efficient Vision-Language-Action Models
Abstract
Vision-Language-Action (VLA) models integrate visual perception, language understanding, and action generation into unified systems for embodied AI tasks. However, foundational VLAs are hindered by the prohibitive computational and data demands inherent to their large-scale architectures. This survey addresses how to make VLA models more practical by reducing their computational and data requirements. The authors establish the first comprehensive taxonomy organizing efficiency techniques across the complete pipeline of model development, training, and data handling.
Key Contributions
- Unified framework organizing efficiency techniques across the full VLA pipeline: visual encoder, LLM backbone, action decoder
- Three core pillars: Efficient Model Design (architectures and compression), Efficient Training (reducing computational burdens), Efficient Data Collection (robotic data acquisition and utilization)
- Analysis of representative applications across embodied AI tasks
- Continuously updated project page tracking latest developments
- Identification of open challenges and future research directions for practical VLA deployment
VLA Pipeline Architecture
The standard VLA pipeline consists of three stages:
- Visual Encoder — processes visual observations into feature representations
- LLM Backbone — integrates visual features with language instructions for reasoning and planning
- Action Decoder — translates LLM outputs into executable robot actions
Efficiency Techniques
Model Design
- Architecture compression (pruning, quantization, distillation)
- Lightweight adapter modules for domain transfer
- Efficient attention mechanisms reducing quadratic complexity
Training Efficiency
- Parameter-efficient fine-tuning (LoRA, adapters)
- Curriculum learning strategies
- Multi-task training with shared representations
Data Efficiency
- Simulation-to-real transfer reducing physical data needs
- Data augmentation strategies for robotic demonstrations
- Active learning for targeted data collection
Tags
multimodalvision-language-actionefficiencyrobotics-ai