Toronto AI Lab

Trajeglish: Traffic Modeling as Next-Token Prediction

Jonah Philion1,2,3, Xue Bin Peng1,4, Sanja Fidler1,2,3

1NVIDIA, 2University of Toronto, 3Vector Institute, 4Simon Fraser University

Visualization of closed-loop rollouts of length 20 seconds with all agents controlled by our model

A longstanding challenge for self-driving development is simulating dynamic driving scenarios seeded from recorded driving logs. In pursuit of this functionality, we apply tools from discrete sequence modeling to model how vehicles, pedestrians and cyclists interact in driving scenarios. Using a simple data-driven tokenization scheme, we discretize trajectories to centimeter-level resolution using a small vocabulary. We then model the multi-agent sequence of discrete motion tokens with a GPT-like encoder-decoder that is autoregressive in time and takes into account intra-timestep interaction between agents. Scenarios sampled from our model exhibit state-of-the-art realism; our model tops the Waymo Sim Agents Benchmark, surpassing prior work along the realism meta metric by 3.3% and along the interaction metric by 9.9%. We ablate our modeling choices in full autonomy and partial autonomy settings, and show that the representations learned by our model can quickly be adapted to improve performance on nuScenes. We additionally evaluate the scalability of our model with respect to parameter count and dataset size, and use density estimates from our model to quantify the saliency of context length and intra-timestep interaction for the traffic modeling task.


News


Paper

Jonah Philion, Xue Bin Peng, Sanja Fidler

Trajeglish: Traffic Modeling as Next-Token Prediction

ICLR 2024 (poster)

[preprint] [bibtex]


Main Idea

Trajeglish We use discrete sequence modeling to model the interaction between agents found in driving scenarios, such as vehicles, pedestrians, and cyclists. To tokenize a trajectory, we iteratively find the action among a finite set of options with minimum distance to the corners of the original trajectory, as shown below.

Our tokenization strategy involves iteratively finding the action closest to ground-truth.

Motion Tokens We use an approach we call "k-disks" to find an action set that results in the lowest possible discretization error. We collect a large number of state-to-state transitions, then iteratively sample an action, filter out any actions with corner distance within ε, and continue until a specified number of actions have been chosen. We find template sets using this procedure with vocabulary size 384 that achieve a discretization error of only 1 cm across the Waymo Open Dataset.

K-disks consistently results in higher-quality template sets than k-means.


Visualization of the template sets found with k-disks for vocabulary sizes 128, 256, 384, and 512.

Trajeglish Modeling We train an encoder-decoder transformer in the style of GPT to model sequences of multi-agent trajectories. Given the tokenized trajectories for all agents in a scene, we flatten them into a single sequence such that the actions for all agents for the first timestep come first, followed by all actions for the second timestep, and so on. This ordering guarantees that the model can be used as a policy at test time. For the encoder, we use VectorNet to encode the map and Latent Query Attention to encode the scene initialization.

We model Trajeglish using an encoder-decoder GPT-like transformer.

2023 WOMD Sim Agents Benchmark Results We test the sampling performance of our model using the WOMD Sim Agents Benchmark. Submissions to this benchmark are required to submit 32 rollouts of length 8 seconds at 10hz per scenario, each of which contains up to 128 agents. Trajeglish is the top submission along the leaderboard meta metric, outperforming several well-established motion prediction models including Wayformer, MultiPath++, and MTR, while being the first submission to use discrete sequence modeling. Most of the improvement is due to the fact that Trajeglish models interaction between agents significantly better than prior work, increasing the state-of-the-art along interaction metrics by 9.9%.

Trajeglish sets a new state-of-the-art on the Waymo Sim Agents Benchmark. An asterisk indicates the submission uses ensembling.
Note these results are from before Waymo announced an adjustment to their metrics for the Sim Agents challenge on December 28, 2023. Updated results after the adjustment to the metrics can be found in the appendix of our paper.

Multiple Rollouts, Same Initialization We visualize multiple rollouts from the model given the same initialization. Trajeglish controls all agents in the scene. Trajeglish models the bi-modal nature of the grey car's future trajectory distribution well, in addition to the dependence of the yellow car's trajectory on the grey car's trajectory. The model is prompted using only 1 timestep of initial context.

Visualization of multiple rollouts prompted from the same initialization. Trajeglish controls all agents.

Reacting to Replay Agents Importantly, Trajeglish is fully reactive to the motion of other agents, re-planning at 10hz. On the left, we sample scenarios in which all agents are controlled with Trajeglish. On the right, we sample scenarios in which only the blue agent is controlled by Trajeglish and the other agents follow the trajectory recorded in the data. The Trajeglish agent on the right reacts appropriately to the replay agents, demonstrating the fact that Trajeglish agents to not rely on privileged information about the future motion of other agents when selecting actions at each timestep. This property crucial for using traffic models to control NPCs when testing black-box AV systems.

On the left, Trajeglish controls all agents. On the right, Trajeglish controls only the blue agent. The Trajeglish agents react appropriately to the motion of other vehicles in both cases, independent of the method used to control each of the other agents.