CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models

Abstract

Vision-language-action models (VLAs) have shown potential in leveraging pre-trained vision-language models and diverse robot demonstrations for learning generalizable sensorimotor control. While this paradigm effectively utilizes large-scale data from both robotic and non-robotic sources, current VLAs primarily focus on direct input-output mappings, lacking the intermediate reasoning steps crucial for complex manipulation tasks. As a result, existing VLAs lack temporal planning or reasoning capability. In this paper, we introduce a method that incorporates explicit visual chain-of-thought (CoT) reasoning into vision-language-action models (VLAs) by predicting future image frames auto-regressively as visual goals before generating a short action sequence to achieve these goals. We introduce CoT-VLA, a state-of-the-art 7B VLA that can understand and generate visual and action tokens. We demonstrates that CoT-VLA achieves strong performance in manipulation tasks in both the real world and simulation benchmarks.

Publication
Proceedings of the Computer Vision and Pattern Recognition Conference
Yao (Jason) Lu
Yao (Jason) Lu
Senior Research Scientist

Senior Research Scientist at NVIDIA Research.

Song Han
Song Han
Associate Professor

Song Han is an associate professor at MIT EECS.