NVIDIA Logo Spatial Intelligence Lab

ChronoEdit:
Towards Temporal Reasoning for Image Editing and World Simulation

Huan Ling1,*,†
1 NVIDIA    2 University of Toronto
* equal contribution    † corresponding author
PDF Read Paper GitHub Code (Coming Soon)

TL;DR: ChronoEdit reframes image editing as an video generation task to encourage temporal consistency. It leverages a temporal reasoning stage that denoises with “temporal reasoning tokens” to “reason” on physically plausible edits.

Method


ChronoEdit Method Overview

Overview of the ChronoEdit pipeline. From right to left, the denoising process begins in the temporal reasoning stage, where the model imagines and denoises a short trajectory of intermediate frames. These intermediate frames act as reasoning tokens, guiding how the edit should unfold in a physically consistent manner. For efficiency, the reasoning tokens are discarded in the subsequent editing frame generation stage, where the target frame is further refined into the final edited image.


Acknowledgments

The authors would like to thank Product Managers Aditya Mahajan and Matt Cragun for their valuable guidance and support. We further acknowledge the Cosmos Team at NVIDIA, especially Qinsheng Zhang and Hanzi Mao, for their consultation on Cosmos-Pred2.5-2B. We also thank Yuyang Zhao, Junsong Chen, and Jincheng Yu for their insightful discussions. Finally, we are grateful to Ben Cashman, Yuting Yang, and Amanda Moran for their infrastructure support.

Citation

@article{wu2025chronoedit,
    title={ChronoEdit: Towards Temporal Reasoning for Image Editing and World Simulation},
    author={Wu, Jay Zhangjie and Ren, Xuanchi and Shen, Tianchang and Cao, Tianshi and He, Kai and Lu, Yifan and Gao, Ruiyuan and Xie, Enze and Lan, Shiyi and Alvarez, Jose M. and Gao, Jun and Fidler, Sanja and Wang, Zian and Ling, Huan},
    journal={arXiv preprint arXiv:2510.04290},
    year={2025}
}