Tokenize the World into Object-level Knowledge to Address Long-tail Events in Autonomous Driving

Abstract

The autonomous driving industry is increasingly adopting end-to-end learning from sensory inputs to minimize human biases in system design. Traditional end-to-end driving models, however, suffer from long-tail events due to rare or unseen inputs within their training distributions. To address this, we propose TOKEN, a novel Multi-Modal Large Language Model (MM-LLM) that tokenizes the world into object-level knowledge, enabling better utilization of LLM’s reasoning capabilities to enhance autonomous vehicle planning in long-tail scenarios. Instead of training the driving scene tokenizer from scratch, we propose an efficient approach to obtain object-centric tokens from an off-the-shelf foundation model, leveraging its abstraction capacity with deep architectural priors. We then employ a multi-stage training strategy to fine-tune TOKEN’s planner for ego-vehicle planning using a variety of planning tasks. In our experiments, we demonstrate that TOKEN excels in grounding, reasoning, and planning capabilities, outperforming existing frameworks with a 27% reduction in trajectory L2 error and a 39% decrease in collision rates in long-tail scenarios. Additionally, our work provides valuable infrastructure for the community, including an extraction of nuPlan’s long-tail scenarios and the creation of question-answer pairs for transformer-based planners.

Publication
CoRL 2024

Related