ChronoEdit:
Towards Temporal Reasoning for Image Editing and World Simulation
TL;DR: ChronoEdit reframes image editing as an video generation task to encourage temporal consistency. It leverages a temporal reasoning stage that denoises with “temporal reasoning tokens” to “reason” on physically plausible edits.
Gallery
Image Editing Results
Hover to see the edited result
Temporal Reasoning Visualization
Hover to see the temporal reasoning trajectory and edited result
Built upon a video model, ChronoEdit can visualize how it “reasons” an edit by denoising the temporal reasoning tokens, revealing the editing trajectory behind its final output.
We introduce temporal reasoning tokens between the reference and edited image latents, serving as intermediate guidance that helps the model “think” through plausible editing trajectories. At inference, these tokens need not be fully denoised for efficiency; however, in the results shown below, we optionally denoise them into a clean video to visualize how the model reasons and interprets an editing task. Note that, in the context of image editing, the final frame of each video is the output edited image.
Physical AI Related Tasks
Hover to see the edited result
ChronoEdit produces edits that faithfully follow the given instructions while preserving scene structure and fine details in Physical-AI–related scenes (such as for autonomous vehicles or humanoid), where maintaining physical consistency is especially critical.
Method

Overview of the ChronoEdit pipeline. From right to left, the denoising process begins in the temporal reasoning stage, where the model imagines and denoises a short trajectory of intermediate frames. These intermediate frames act as reasoning tokens, guiding how the edit should unfold in a physically consistent manner. For efficiency, the reasoning tokens are discarded in the subsequent editing frame generation stage, where the target frame is further refined into the final edited image.