WorldTrace: Addressable Memory for Video World Models

TL;DR

TL;DR. WorldTrace keeps compressed memory addressable with fixed in-distribution slot positions, then uses canonical-key writers for two goals: WorldTrace-Field for smoother long rollouts and WorldTrace-Landmark for recalling previously visited scenes, all without retraining the generator.

Abstract

We study visual persistence in autoregressive video world models, where Key–Value (KV) caches store growing visual memory but become hard to retrieve from beyond the training horizon. We identify out-of-distribution temporal RoPE offsets as the root cause: past observations may remain cached, yet become unaddressable to attention. WorldTrace is a training-free framework that keeps compressed memory addressable by assigning each slot a fixed, in-distribution position relative to the current frame. Built on this addressable cache, WorldTrace-Field improves coherent long rollouts with rotation-invariant history aggregation, while WorldTrace-Landmark preserves verbatim scene traces for long-range recall.

Motivation

Autoregressive video world models promise interactive worlds, but visual persistence collapses once generation exceeds the training horizon.

Can an autoregressive video world model reliably remember where it has been, at any generation length?

Why memory breaks beyond the training horizon

Two coupled bottlenecks arise once generation crosses the training context window:

1. Addressability

Temporal RoPE offsets exceed the training range, so cached memories become unreadable even when they are physically present. Past the trained window, attention queries see phases the model never learned to address.

2. Content fidelity

Naive key averaging in RoPE-rotated space mixes incompatible phases. The resulting phase cancellation destroys the signal that compressed summaries are supposed to carry.

Method: WorldTrace

A two-tier KV cache: a verbatim recent window plus $N_s$ summary slots, with positions assigned by slot rank alone (independent of horizon) so every summary stays in-distribution at any generation length. Two complementary writers fill the slots:

WorldTrace cache pipeline: recent window plus summary slots with slot-rank positions; two writers, WT-Field and WT-Landmark.

Slot-rank position assignment

Virtual position for summary slot $s$:

$$v_s = q - \bigl(L_{\mathrm{train}} - 1 - s\bigr)\cdot F$$

$q$ is the current query position, $L_{\mathrm{train}}$ is the training context length, and $F$ is the number of frames per autoregressive block. Slot positions depend on rank, not rollout length.

Canonical WT-Field writer

$$K_{\mathrm{field}}^{(k)}(t_v) = R(\theta_k t_v)\,\frac{1}{M}\sum_{m=1}^{M} R(-\theta_k t_m)\, K_{t_m}^{(k)}$$

Keys are first aligned into a shared canonical phase, averaged, then re-rotated to the summary slot position. This avoids phase cancellation and preserves mean attention logits. WT-Field targets temporal coherence under compression; not a recall mechanism.

Frozen WT-Landmark writer

$$K_{\mathrm{land}}^{(k)}(t_v) = R(\theta_k t_v)\, R(-\theta_k t_{\ell^*})\, K_{t_{\ell^*}}^{(k)}$$

Scene-entry frames are detected from the canonical-key signal, stored verbatim into summary slots, and frozen on insertion to avoid bfloat16 drift from repeated unrotate→rerotate shifts. WT-Landmark keeps slot-rank positions unchanged and targets episodic recall over long rollouts.

Results

WorldTrace-Landmark: episodic recall

LoopMem tests episodic recall by asking the model to return to previously visited scenes and scoring the regenerated view with Position-Aligned CLIP (PAC). Across topology, path length, camera orientation, and multi-revisit settings, WT-Landmark consistently improves over sliding-window recall: 0.825 vs. 0.627 PAC on the long ABA path, 0.864 vs. 0.723 on standard ABA, and 0.941 vs. 0.892 on ABABA. The hardest $360^\circ$ pan shows the smallest gain (0.577 vs. 0.559), making the limitation visible rather than hidden by the aggregate.

Topology

Vary the number of intermediate waypoints before returning to the starting scene.

Edge length

Increase the number of generated chunks per leg to stretch context distance.

Orientation

Stress recall under camera-orientation changes, including wide pans.

Multi-revisit

Revisit the same place multiple times to test repeated episodic recall.

LoopMem benchmark scenarios: varying topology, edge length, camera orientation, and multi-revisit patterns.
LoopMem benchmark scenarios from the poster: topology, edge length, camera orientation, and multi-revisit recall.
LoopMem PAC results comparing sliding window and WorldTrace-Landmark across topology, edge length, orientation, and multi-revisit scenarios.
PAC across the full LoopMem suite. WT-Landmark improves recall in every evaluated scenario, with the largest gains on longer paths and the smallest gain on the $360^\circ$ camera pan.

Pan recall videos

Pan path
Pan 1: Initial frame
Initial frame for Pan 1 from the sliding-window video.
Pan 1: Sliding window (baseline)
Pan 1: Ours
Pan 2: Initial frame
Initial frame for Pan 2 from the sliding-window video.
Pan 2: Sliding window (baseline)
Pan 2: Ours
Pan 3: Initial frame
Initial frame for Pan 3 from the sliding-window video.
Pan 3: Sliding window (baseline)
Pan 3: Ours

ABA recall videos

ABA path
ABA 1: Initial frame
Initial frame for ABA 1 from the sliding-window video.
ABA 1: Sliding window (baseline)
ABA 1: Ours
ABA 3: Initial frame
Initial frame for ABA 3 from the sliding-window video.
ABA 3: Sliding window (baseline)
ABA 3: Ours

WorldTrace-Field: coherence rollouts

Holding the content operator fixed (canonical averaging) and varying only the position assignment, slot-rank positions lead Block-Rel by +5.9% TempSSIM at 8× horizon and +2.8% at 16×. At 24× (N=48), WT-Field improves +15.5% TempSSIM over sliding-window while also lowering Local Scene Drift, where every $N$-dependent position formula degrades non-monotonically.

Same conditioning, four position schemes. Sliding window and Block-Rel diverge by $t = 18$–$24$ (red borders); WT-Field stays coherent through $t = 48$ (24× the training horizon).

Coherence rollout videos

Sliding window
Block-Rel
Centroid
WT-Field

Takeaways

Diagnosis

  • Long-horizon failure is a position problem, not a content problem.
  • Naive averaging in RoPE-rotated space causes phase cancellation.
  • Compression-only summaries collapse to a sliding window once their slots fall outside the trained range.

Method

  • Slot-rank virtual positions: every summary stays in-distribution at any horizon.
  • WorldTrace-Field: canonical-key averaging for coherence.
  • WorldTrace-Landmark: frozen verbatim traces for recall.

Impact

  • +15.5% TempSSIM at 24× training horizon (WT-Field).
  • Higher PAC across LoopMem: WT-Landmark improves over sliding window in topology, edge-length, orientation, and multi-revisit scenarios.
  • Training-free, $O(1)$ summary cache: drop-in for AR video world models.

Citation

@inproceedings{wu2026worldtrace,
  title={Addressable Memory for Video World Models},
  author={Xindi Wu and Sven Elflein and James Lucas and Olga Russakovsky and Laura Leal-Taix\'{e} and Despoina Paschalidou and Jonathan Lorraine and Aljo\v{s}a O\v{s}ep},
  booktitle={ICML 2026 Workshop: From Frames to Stories (F2S)},
  note={Oral presentation},
  year={2026},
}