From Vision-Language Models to Robot Control — Without Forgetting (Princeton)
VLM2VLA pipeline converts VLMs into VLA policies without catastrophic forgetting — natural-language action representation
From Vision-Language Models to Robot Control — Without Forgetting
Abstract
Princeton's AI Lab published VLM2VLA, a pipeline and training methodology for converting general-purpose vision-language models (VLMs) into vision-language-action (VLA) policies for robot manipulation without catastrophic forgetting of the VLM's foundational perceptual and reasoning capabilities. Robot actions are expressed in natural language (e.g., "to grasp the object, move forward and slightly left, then move significantly downward before closing the gripper") and translated into low-level joint commands.
Key Contributions
- No catastrophic forgetting: a chronic problem in VLA fine-tuning is that adapting a VLM for robot control degrades its general perception and reasoning. VLM2VLA's pipeline preserves the base VLM's capabilities.
- Natural-language action representation: robot actions output as natural language descriptions before mapping to low-level commands. Enables auditable, interpretable, debuggable policies.
- Data pipeline + training methodology — not a single model, but a recipe for converting any VLM into a VLA.
Methodology
The paper describes a fine-tuning regime that interleaves robot manipulation data with general VLM training data, applies regularization to preserve representation distance from the base model, and structures the action output as natural-language action plans before conversion to low-level commands.
Results
- Validated across multiple manipulation benchmarks (specifics in full paper).
- Preserves VLM-level performance on standard VL benchmarks while learning manipulation policies.
- Natural-language action representation supports downstream debugging and human oversight.
Limitations
- Adds latency from the language → action translation step.
- Specific sample-efficiency improvements over OpenVLA-OFT and similar approaches need direct comparison.
- Manipulation skill ceiling may be lower than highly-specialized non-VLM-based policies (e.g., diffusion policy variants).
Full Content
VLM2VLA is one of several April 2026 papers attacking the same problem: how to convert powerful general-purpose foundation models into competent robot controllers without losing what made them general-purpose. The Princeton angle — natural-language action representation — is interpretability-forward and debuggable, useful for safety-critical deployments.
This is part of the broader VLA wave: NVIDIA's GR00T N1.7 (March 2026 open-source release with 20K hours EgoScale human-video pretraining), the Humanoid-COA framework for zero-shot loco-manipulation, OpenVLA-OFT for action-generation efficiency. The category is moving from "can a VLM control a robot" to "what's the most efficient and most capability-preserving way to do it." VLM2VLA is a strong entry on the capability-preservation axis.
Practical implication: as VLMs improve (Gemini 3, GPT-5, Claude Opus 4.7), VLA policies built on them inherit those gains via VLM2VLA-style pipelines without retraining from scratch. This makes VLA a category that moves at VLM speed, not robotics-data-collection speed.
Source: Princeton AI Lab Blog — From Vision-Language Models to Robot Control, Without Forgetting, April 23, 2026