A Survey on Efficient Vision-Language-Action Models

Paper
Zhaoshu Yu, Bo Wang, Pengpeng Zeng et al.October 27, 2025
Original Source
Key Contribution

First comprehensive taxonomy for VLA efficiency across model design, training, and data collection pillars

A Survey on Efficient Vision-Language-Action Models

Abstract

Vision-Language-Action (VLA) models integrate visual perception, language understanding, and action generation into unified systems for embodied AI tasks. However, foundational VLAs are hindered by the prohibitive computational and data demands inherent to their large-scale architectures. This survey addresses how to make VLA models more practical by reducing their computational and data requirements. The authors establish the first comprehensive taxonomy organizing efficiency techniques across the complete pipeline of model development, training, and data handling.

Key Contributions

  • Unified framework organizing efficiency techniques across the full VLA pipeline: visual encoder, LLM backbone, action decoder
  • Three core pillars: Efficient Model Design (architectures and compression), Efficient Training (reducing computational burdens), Efficient Data Collection (robotic data acquisition and utilization)
  • Analysis of representative applications across embodied AI tasks
  • Continuously updated project page tracking latest developments
  • Identification of open challenges and future research directions for practical VLA deployment

VLA Pipeline Architecture

The standard VLA pipeline consists of three stages:

  1. Visual Encoder — processes visual observations into feature representations
  2. LLM Backbone — integrates visual features with language instructions for reasoning and planning
  3. Action Decoder — translates LLM outputs into executable robot actions

Efficiency Techniques

Model Design

  • Architecture compression (pruning, quantization, distillation)
  • Lightweight adapter modules for domain transfer
  • Efficient attention mechanisms reducing quadratic complexity

Training Efficiency

  • Parameter-efficient fine-tuning (LoRA, adapters)
  • Curriculum learning strategies
  • Multi-task training with shared representations

Data Efficiency

  • Simulation-to-real transfer reducing physical data needs
  • Data augmentation strategies for robotic demonstrations
  • Active learning for targeted data collection

Tags

multimodalvision-language-actionefficiencyrobotics-ai

Identifiers

A Survey on Efficient Vision-Language-Action Models | KB | MenFem