Humanoid Agent via Embodied Chain-of-Action Reasoning for Zero-Shot Loco-Manipulation

Paper
Congcong Wen et al.NYU, Harvard, UCL, University of LiverpoolApril 13, 2025
Original Source
Key Contribution

First zero-shot loco-manipulation framework using foundation models on physical humanoids

Humanoid Agent via Embodied Chain-of-Action Reasoning

Abstract

Introduces Humanoid-COA, a framework for humanoid loco-manipulation integrating whole-body movement with object manipulation from natural language instructions. Uses an Embodied Chain-of-Action (CoA) mechanism that decomposes high-level human instructions into structured sequences of locomotion and manipulation primitives through affordance analysis, spatial reasoning, and whole-body action planning.

Key Contributions

  • First humanoid agent framework integrating foundation model reasoning for zero-shot loco-manipulation under natural language instructions
  • Novel Embodied CoA mechanism decomposing high-level intent into executable whole-body behaviors for long-horizon tasks
  • Real-world validation demonstrating robust zero-shot generalization across diverse loco-manipulation tasks on two physical platforms

Methodology

Perception-reasoning-action paradigm:

  • Perception: GPT-4V converts RGB-D observations into scene descriptions
  • Reasoning: CoA integrates object affordance analysis, region spatial reasoning, whole-body movement inference
  • Execution: Grounds symbolic plans into motor commands via pre-trained controllers
  • Tested on Unitree H1-2 and G1 humanoid robots

Results

  • Manipulation: 96.6% grasping, 93.3% relocation, 73.3% rearrangement
  • Locomotion: 96.6% target approach, 63.3% navigation under occlusion
  • Loco-Manipulation: 90.0% mobile pick, 96.6% mobile place, 63.3% long-horizon combined
  • Without all three CoA components: only 50% executability

Limitations

  • Complex rearrangement tasks show lower reliability (73.3%)
  • Long-horizon combined tasks remain challenging (56-63%)
  • Dependence on pre-trained foundation models (GPT-4, GPT-4V)
  • No adaptation mechanism for failure recovery

Source: Humanoid-COA by Wen et al., NYU/Harvard/UCL

Tags

humanoidfoundation-modelsloco-manipulationzero-shotunitree

Identifiers

Humanoid Agent via Embodied Chain-of-Action Reasoning for Zero-Shot Loco-Manipulation | KB | MenFem