Large Video Planner Enables Generalizable Robot Control

Arxiv 2025
* Equal Contribution

  • 1MIT      2UC Berkeley      3Harvard

TL;DR: LVP is a video foundation model for robotics. It generates a video plan and then deploy on robots. We ask indepdendent evaluators propose any tasks and any scenes to show its generalization.

Abstract

General-purpose robots require decision-making models that generalize across diverse tasks and environments. Recent works build robot foundation models by extending multimodal large language models (MLLMs) with action outputs, creating vision-language-action (VLA) systems. These efforts are motivated by the intuition that MLLMs' large-scale language and image pretraining can be effectively transferred to the action output modality. In this work, we explore an alternative paradigm of using large-scale video pretraining as a primary modality for building robot foundation models. Unlike static images and language, videos capture spatio-temporal sequences of states and actions in the physical world that are naturally aligned with robotic behavior. We curate an internet-scale video dataset of human activities and task demonstrations, and train, for the first time at a foundation-model scale, an open video model for generative robotics planning. The model produces zero-shot video plans for novel scenes and tasks, which we post-process to extract executable robot actions. We evaluate task-level generalization through third-party selected tasks in the wild and real-robot experiments, demonstrating successful physical execution. Together, these results show robust instruction following, strong generalization, and real-world feasibility. We release both the model and dataset to support open, reproducible video-based robot learning.

Robot Experiments (Novel Scenes and Tasks)

(Click to see more results.)

Given the generated video plans, we first reconstruct and track the hand in 3D, then retarget the hand motion to the robot hand, and execute the actions on the real robot.

Multi-stage Video Planning
(Generated videos on third-party collected scenes)

(Click to see more results. )

Large Video Planner with multi-stage planning.

Zero-shot Prompt Following
(All videos are generated)

(Click to see more results. )

Large Video Planner enables zero-shot prompt following for novel scenes and tasks.

Place bottle on paper

Place orange in red bowl

Press stapler

Grab the silver metal cup on the left side of the table

Pour water into the small gray metal cup on the left

Pick up the dark green cup on the right



Results Gallery
(All videos are generated)

(Click to see more results. )

Large Video Planner enables zero-shot prompt following for novel scenes and tasks. Here are some examples of the results.

Place Screwdriver On the Case

Place Silver Mug

Grab the black gas nozzle

Pull out a tissue

Pull the straw out of the lid

Reaches for the silver doorknob to open the gate

Qualitative Comparisons

Ours

Press the button to flush the toilet

Move the mouse leftwards on the keyboard

Turn the book to the next page

Wan I2V 14B

Hunyuan 14B

Cosmos Predict 2

BibTeX


@misc{chen2025largevideoplanner,
  title={Large Video Planner}, 
  author={Boyuan Chen and Tianyuan Zhang and Haoran Geng and Kiwhan Song and William T. Freeman and Jitendra Malik and Russ Tedrake and Vincent Sitzmann and Yilun Du},
  year={2025},
  eprint={2512.15840},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={http://arxiv.org/abs/2512.15840}, 
}