Foundation Models for Robotics

Active Frontier

foundation-modelsvision-languagezero-shotgpt-4v

Foundation Models for Robotics

Foundation models — large-scale vision-language models like GPT-4V — are enabling a paradigm shift in robot control: from task-specific training to zero-shot behavior generation from natural language instructions. Rather than collecting thousands of demonstrations for each new task, robots can leverage the broad knowledge already encoded in these models to interpret scenes, decompose tasks, and generate executable actions.

Wen et al.'s Humanoid-COA is the first framework that integrates foundation model reasoning specifically for zero-shot humanoid loco-manipulation. The architecture follows a perception-reasoning-action paradigm: GPT-4V handles scene understanding and spatial reasoning, the Chain of Action (CoA) mechanism decomposes high-level instructions into a sequence of actionable subtasks, and pre-trained locomotion/manipulation controllers execute the resulting behaviors.

The key insight is separation of concerns — the foundation model handles semantic understanding and task planning (what to do and in what order), while specialized controllers handle the physics of execution (how to move). This avoids the need to train end-to-end policies that must simultaneously understand language and control actuators.

Gu et al.'s survey contextualizes this within the broader landscape, noting that learning-based approaches are increasingly displacing model-based methods that required explicit task-specific engineering.

Cao (2024) adds a paradigm-level framing: GenAI and LLM integration is precisely the mechanism by which humanoids transition from the "human-looking" to the "human-like" paradigm. The key enabling technology on the horizon is vision-language-action (VLA) modeling — systems that jointly process visual scenes, natural language instructions, and action histories to generate coherent behavioral outputs in real time. VLA represents the next step beyond current two-stage approaches (VLM reasoning + pre-trained controllers) toward genuinely end-to-end multimodal control.

Key Claims

First framework for zero-shot humanoid loco-manipulation via foundation models — Humanoid-COA integrates GPT-4V reasoning with pre-trained controllers, achieving 96.6% grasping and 90% mobile pick without task-specific training. Evidence: strong (Humanoid-COA: Chain of Action)
CoA decomposes high-level instructions into executable whole-body behaviors — Chain of Action bridges the gap between language understanding and physical execution through structured task decomposition. Evidence: strong (Humanoid-COA: Chain of Action)
Learning-based methods are displacing model-based approaches — Three-decade survey confirms the trend toward data-driven rather than physics-engineered robot behaviors. Evidence: strong (Humanoid Locomotion & Manipulation Survey)
GenAI integration enables the human-looking → human-like transition — Cao (2024) identifies GenAI/LLM integration as the mechanism that unlocks real-time interactive multimodal capabilities previously unattainable. Evidence: strong (Humanoid Robots & Humanoid AI Review)
VLA modeling is the emerging frontier beyond current two-stage approaches — Vision-language-action models that jointly process perception, language, and action history represent the next step toward end-to-end humanoid control. Evidence: moderate (Humanoid Robots & Humanoid AI Review)
Only limited humanoids leverage GenAI or LLMs as of 2024 — Despite the capability advantages, most deployed systems still rely on narrow AI or pre-programmed behaviors. Evidence: strong (Humanoid Robots & Humanoid AI Review)

Open Questions

How does dependence on external APIs (GPT-4V) affect reliability and latency for real-time robot control?
Can foundation models generalize to truly novel environments they were not exposed to during pre-training?
What happens when the semantic understanding of the scene is incorrect — how do error recovery mechanisms work?
Is the perception-reasoning-action separation optimal, or will end-to-end VLA models eventually outperform modular approaches?
How should foundation models encode ethical and social reasoning for the human-level paradigm?

Related Concepts

Humanoid Loco-Manipulation — The primary application domain for foundation model-driven robot control
World Models — Internal predictive models that complement foundation model reasoning
Humanoid Capability Paradigms — GenAI integration as the enabler for the human-like paradigm

Related Entities

Unitree — H1-2 and G1 used as test platforms for Humanoid-COA

Backlinks

Pages that reference this concept:

Changelog

2026-04-14 — Added VLA modeling as emerging frontier from Cao (2024). Added GenAI paradigm-transition claims. Linked humanoid-capability-paradigms.
2026-04-05 — Initial compilation.

Related Concepts

Theses that depend on this concept

These research positions cite this concept in their evidence. If the concept changes materially, these theses may need re-scoring.

T1never reviewed

Humanoid robots will be deployed in 50% of large factories by 2030

6.0/10

no history yet

T3never reviewed

Foundation models (VLMs) will replace task-specific training as the dominant control paradigm within 3 years

6.0/10

no history yet

Sources

humanoid-coa-chain-of-action humanoid-locomotion-manipulation-survey humanoid-robots-humanoid-ai-review-perspectives-directions

Foundation Models for Robotics

Key Claims

First framework for zero-shot humanoid loco-manipulation via foundation models — Humanoid-COA integrates GPT-4V reasoning with pre-trained controllers, achieving 96.6% grasping and 90% mobile pick without task-specific training. Evidence: strong (Humanoid-COA: Chain of Action)

CoA decomposes high-level instructions into executable whole-body behaviors — Chain of Action bridges the gap between language understanding and physical execution through structured task decomposition. Evidence: strong (Humanoid-COA: Chain of Action)

Learning-based methods are displacing model-based approaches — Three-decade survey confirms the trend toward data-driven rather than physics-engineered robot behaviors. Evidence: strong (Humanoid Locomotion & Manipulation Survey)

GenAI integration enables the human-looking → human-like transition — Cao (2024) identifies GenAI/LLM integration as the mechanism that unlocks real-time interactive multimodal capabilities previously unattainable. Evidence: strong (Humanoid Robots & Humanoid AI Review)

VLA modeling is the emerging frontier beyond current two-stage approaches — Vision-language-action models that jointly process perception, language, and action history represent the next step toward end-to-end humanoid control. Evidence: moderate (Humanoid Robots & Humanoid AI Review)

Only limited humanoids leverage GenAI or LLMs as of 2024 — Despite the capability advantages, most deployed systems still rely on narrow AI or pre-programmed behaviors. Evidence: strong (Humanoid Robots & Humanoid AI Review)

Open Questions

How does dependence on external APIs (GPT-4V) affect reliability and latency for real-time robot control?

Can foundation models generalize to truly novel environments they were not exposed to during pre-training?

What happens when the semantic understanding of the scene is incorrect — how do error recovery mechanisms work?

Is the perception-reasoning-action separation optimal, or will end-to-end VLA models eventually outperform modular approaches?

How should foundation models encode ethical and social reasoning for the human-level paradigm?

Foundation Models for Robotics

Foundation Models for Robotics

Key Claims

Open Questions

Related Concepts

Related Entities

Backlinks

Changelog

Related Concepts

Humanoid Capability Paradigms

Humanoid Loco-Manipulation

World Models

Theses that depend on this concept

Humanoid robots will be deployed in 50% of large factories by 2030

Foundation models (VLMs) will replace task-specific training as the dominant control paradigm within 3 years

Sources

Foundation Models for Robotics

Foundation Models for Robotics

Key Claims

Open Questions

Related Concepts

Related Entities

Backlinks

Changelog

Related Concepts

Humanoid Capability Paradigms

Humanoid Loco-Manipulation

World Models

Theses that depend on this concept

Humanoid robots will be deployed in 50% of large factories by 2030

Foundation models (VLMs) will replace task-specific training as the dominant control paradigm within 3 years

Sources