Dynamic Vision and Learning Group NVIDIA Research
Better Call SAL: Towards Learning to Segment Anything in Lidar
SAL Logo

Better Call SAL:
Towards Learning to Segment Anything in Lidar

1 NVIDIA
2 Carnegie Mellon University

* Equal Contribution

arXiv:2403.13129

Segment Anything in Lidar (SAL): The SAL model performs class-agnostic instance segmentation (i) and zero-shot classification via text prompting. This allows us to not only predict semantic/panoptic segmentation (ii) for fixed class vocabularies but segment any object (iii and iv) in a given Lidar scan.


Abstract


We propose the SAL (Segment Anything in Lidar) method consisting of a text-promptable zero-shot model for segmenting and classifying any object in Lidar, and a pseudo-labeling engine that facilitates model training without manual supervision. While the established paradigm for Lidar Panoptic Segmentation (LPS) relies on manual supervision for a handful of object classes defined a priori, we utilize 2D vision foundation models to generate 3D supervision "for free". Our pseudo-labels consist of instance masks and corresponding CLIP tokens, which we lift to Lidar using calibrated multi-modal data. By training our model on these labels, we distill the 2D foundation models into our Lidar SAL model. Even without manual labels, our model achieves 91% in terms of class-agnostic segmentation and 44% in terms of zero-shot LPS of the fully supervised state-of-the-art. Furthermore, we outperform several baselines that do not distill but only lift image features to 3D. More importantly, we demonstrate that SAL supports arbitrary class prompts, can be easily extended to new datasets, and shows significant potential to improve with increasing amounts of self-labeled data.

SAL overview: Given a Lidar scan and a class vocabulary prompt, specified as a list of per-class free-form text descriptions (left), SAL segments and classifies objects (thing and stuff classes). As labeled data for training such a model does not exist, we supervise SAL by distilling off-the-shelf vision foundation models to Lidar (right).

SAL Pseudo-Labeling Engine


The key SAL component, our pseudo-label engine, transfers image segmentation and vision-language models into a (noisy) Lidar segmentation supervisory signal. We utilize SAM (Kirillov et al., ICCV'23) to generate class-agnostic masks in images, and CLIP (Radford et al., ICCV'21) to generate per-mask CLIP tokens that connect visual features to language and a calibrated sensory setup to transfer them to the Lidar domain. We distill these pseudo-labels to our zero-shot model, which segments and classifies Lidar point clouds.

SAL Zero-Shot Model


SAL model employs a sparse-convolutional 3D backbone, followed by a Transformer decoder that predicts objectness scores, segmentation masks, and CLIP tokens for each query. To (optionally) perform zero-shot classification, we forward the dataset class vocabulary through the CLIP text encoder and match the encoded vocabulary with predicted CLIP tokens. Our model requires no retraining for different vocabularies and no image features at inference.

Key Results


Class-Agnostic Lidar Segmentation

Class-Agnostic Segmentation on Waymo Open. We visually compare class-agnostic segmentation results. Colors encode object instance IDs. Left: baseline (SAM, unprojected to Lidar), and, right, SAL. As can be seen, the baseline that directly lifts SAM masks to Lidar is limited to 270° field of view, which overlaps with the camera ring. By contrast, SAL segments the full Lidar point cloud and is not limited by the camera coverage. Zoomed-in regions show that the baseline is sensitive to edge bleeding (e.g., see pedestrian and traffic sign masks, partially projected to the blue wall). SAL, by contrast, distills noisy SAM masks into crisp segmentation masks.

Class-agnostic segmentation on Waymo Open from first-person perspective. We visually outline the Lidar point cloud, where points are colored according to estimated instance IDs, estimated by SAL. We show corresponding camera views (not used for inference) for reference. As can be seen, SAL accurately segments a large variety of objects, including parking meters, potted trees (pots as well as trees), rooftop ladder, water hydrant, post box, traffic cone, traffic barrier, and more. Canonical objects, such as car, van, bus, and pedestrian are segmented as well. This class-agnostic segmentation is a basis for zero-shot classification.

Zero-Shot Lidar Panoptic Segmentation

Zero-shot panoptic segmentation. We utilize prior efforts in the image domain and Lidar domain to craft multiple baselines that only unproject segmentation masks and lift image features to Lidar. By contrast, SAL distills outputs of such baselines (pseudo-labels) into a stronger Lidar segmentation model. With Image Feat. we denote methods that require image features at inference time, and Frust. Eval. denotes the evaluation of a subset of points visible in the camera.

Zero-shot per-class prompting on Waymo Open. SAL predicts a set of object instances (left), along with their objectness scores and distilled CLIP features. We can use text prompts and query these instances for specific classes specified as prompts. On the right, we highlight several such examples that are outside of class-vocabularies of SemanticKITTI, nuScenes, and Waymo Open datasets. As can be seen on the left, a basis for such zero-shot prompting is accurate and, importantly, diverse class-agnostic segmentation.

Lidar Panoptic Segmentation Benchmarks

Lidar Panoptic Segmentation (LPS) on SemanticKITTI and nuScenes validation sets: We prompt our zero-shot SAL model with the respective class vocabularies and compare its performance to fully-supervised baselines. On SemanticKITTI and nuScenes, we reach $40\%$ and $44\%$ of the fully-supervised equivalent of our model, respectively. This gap closes significantly when we evaluate super classes or with a semantics oracle, demonstrating the effective distillation of the 2D segmentation foundation model to 3D.

Qualitative results on SemanticKITTI (top) and nuScenes (bottom). We visualize ground truth (GT) (left), pseudo-labels (middle), and our model output (right). We display semantics for GT labels and SAL outputs (colors encode semantic classes). In the middle row (pseudo-labels) we visualize instances (colors encode instance IDs). Note, that each instance is supplemented with a CLIP token (not visualized). Black points have no associated labels (as can be seen, in SemanticKITTI we have pseudo-labels for only $14\%$ of points).

Qualitative results for Waymo Open dataset. We visualize pseudo labels (left) and our model output (right). Waymo does not provide panoptic ground truth labels. While our output displays semantics (colors encode semantic classes), the class-agnostic pseudo-labels show instances (colors encode instance IDs).

Media Coverage


SAL was recently higlighted at NVIDIA GTC, where NVIDIA showcases how AI and generative AI tools empower autonomous vehicle developers. Enjoy the video!


Citation



    @article{sal2024arxiv,
        title={Better Call SAL: Towards Learning to Segment Anything in Lidar},
        author={Osep, Aljosa and Meinhardt, Tim and Ferroni, Francesco and Peri, 
                    Neehar and Ramanan, Deva and Leal-Taixé, Laura},
        journal={arXiv preprint arXiv:2403.13129},
        year={2024},
    }

Paper


Better Call SAL: Towards Learning to Segment Anything in Lidar

Aljosa Osep*, Tim Meinhardt*, Francesco Ferroni, Neehar Peri, Deva Ramanan, Laura Leal-Taixé

description Paper
insert_comment BibTeX

Acknowledgment


This project was funded, in parts, by ERC Starting Grant DynAI (ERC-101043189). We are grateful to Zan Gojcic, Guillem Brasó, Cristiano Saltori, Sérgio Agostinho, and Jonas Schult for their feedback on the paper and their insightful comments. We are grateful for MaskPLS codebase provided by Photogrammetry and Robotics Lab at University of Bonn. Special thanks to Maxim Maximov for his help on figures!