The goal of perception for autonomous vehicles is to extract semantic representations from multiple sensors and fuse these representations into a single “bird’s-eye-view” coordinate frame for consumption by motion planning. We propose a new end-to-end architecture that directly extracts a bird’s-eye-view representation of a scene given image data from an arbitrary number of cameras. The core idea behind our approach is to “lift” each image individually into a frustum of features for each camera, then “splat” all frustums into a rasterized bird’s-eye- view grid. By training on the entire camera rig, we provide evidence that our model is able to learn not only how to represent images but how to fuse predictions from all cameras into a single cohesive representation of the scene while being robust to calibration error. On standard bird’s- eye-view tasks such as object segmentation and map segmentation, our model outperforms all baselines and prior work. In pursuit of the goal of learning dense representations for motion planning, we show that the representations inferred by our model enable interpretable end-to-end motion planning by “shooting” template trajectories into a bird’s-eye-view cost map output by our network. We benchmark our approach against models that use oracle depth from lidar.
Lift, Splat, Shoot Our goal is to design a model that takes as input multi-view image data from any camera rig and outputs a semantics in the reference frame of the camera rig as determined by the extrinsics and intrinsics of the cameras. The tasks we consider in this paper are bird's-eye-view vehicle segmentation, bird's-eye-view lane segmentation, drivable area segmentation, and motion planning. Our network is composed of an initial per-image CNN followed by a bird's-eye-view CNN connected by a "Lift-Splat" pooling layer (left). To "lift" images into 3D, the per-image CNN performs an attention-style operation at each pixel over a set of discrete depths (right).
Learning Cost Maps for Planning We frame end-to-end motion planning ("shooting") as classification over a set of fixed template trajectories (left). We define the logit for template to be the sum of values in the bird's-eye-view cost map output by our model (right). We then train the model to maximize the likelihood of expert trajectories.
Equivariance To be maximally useful, models that perform inference in the bird's-eye-view frame need to generalize to any choice of bird's-eye-view coordinates. Our model is designed such that it roughly respects equivariance under translations (top left) and rotations (top right) of the camera extrinsics. Lift-Splat Pooling is also exactly permutation invariant (bottom left) and roughly invariant to image translation (bottom right).
Extrinsic Translation
Extrinsic Rotation
Image Permutation
Image Translation
Results We outperform baselines on bird's-eye-view segmentation. We demonstrate transfer across camera rigs in two scenarios of increasing difficulty. In the first, we drop cameras at test time from the same camera rig that was used during training. In the second, we test on an entirely new camera rig (Lyft dataset) from what was used during training (nuScenes dataset).
Bird's-Eye-View Segmentation IOU (nuScenes and Lyft)
nuScenes validation set Input images are shown on the left. BEV inference output by our model is shown on the right. The BEV semantics are additionally projected back onto the input images for visualization convenience.
Camera Dropout At test time, we remove different cameras from the camera rig. When a camera is removed, the network imputes semantics in the blind-spot by using information in the remaining cameras as well as priors about object shapes and road structure.
Train on nuScenes -> Test on Lyft We evaluate a model trained on the nuScenes dataset on the Lyft dataset. Segmentations output by the model are fuzzy but still meaningful. Quantitative transfer results against baselines can be found in our paper.