Zero-Shot 4D Lidar Panoptic Segmentation

Yushan Zhang^1,2*

Aljosa Osep¹

Laura Leal-Taixé¹

Tim Meinhardt¹

¹ NVIDIA

² Linköping University

* work done during an internship at NVIDIA

In proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2024

description arXiv description BibTeX description Video description GitHub

Abstract

We propose SAL-4D (Segment Anything in Lidar-4D), a method that utilizes multi-modal robotic sensor setups as a bridge to distill recent developments in Video Object Segmentation (VOS) in conjunction with off-the-shelf Vision-Language foundation models to Lidar. We utilize VOS models to pseudo-label tracklets in short video sequences, annotate these tracklets with sequence-level CLIP tokens, and lift them to the 4D Lidar space using calibrated multi-modal sensory setups to distill them to our SAL-4D model. Due to temporal consistent predictions, we outperform prior art in 3D Zero-Shot Lidar Panoptic Segmentation (LPS) for over 5 points in terms of PQ, and unlock Zero-Shot 4D-LPS.

`SAL-4D` Pseudo-Labeling Engine

Given a calibrated RGB camera and Lidar sensor, (1) we first independently pseudo-label overlapping sliding windows (left). We track and segment objects in the video using (Ravi et al, arXiv:2408.00714), generate their semantic features using CLIP, and lift labels from images to 4D Lidar space. Finally, we ``flatten'' masklets to obtain a unique non-overlapping set of masklets in Lidar for each temporal window. We (2) associate masklets across windows via linear assignment (LA) to obtain pseudo-labels for full sequences and average their semantic features (right).

`SAL-4D` Zero-Shot Model

We follow tracking-before-detection design and segment and track objects in a class-agnostic fashion. Once localized and tracked, objects can be recognized. To operationalize this, we employ a Transformer decoder-based architecture. In a nutshell, our network consists of a point cloud encoder-decoder network that encodes sequences of point clouds, followed by a Transformer-based object instance decoder that localizes objects in the 4D Lidar space. This design follows our prior work on 4D Lidar Panoptic Segmentation (Aygün et al., CVPR'21).

Inference: As our model directly processes superimposed point clouds within windows of size K, we perform near-online inference by associating Lidar masklets across time based on 3D-IoU overlap via bi-partite matching. For zero-shot prompting, we first encode prompts specified in the semantic class vocabulary using a CLIP language encoder. Then, we perform argmax over scores, computed as a dot product between encoded queries and predicted CLIP features.

Key Results

Prompting with canoncial classes on SemanticKITTI. We show ground-truth (GT) labels (first column), our pseudo-labels (middle column), and SAL-4D results (right column). We show semantic predictions (first row) and instances (second row). As can be seen, our pseudo-labels cover only the camera-visible portion of the sequence (middle). By contrast to GT labels, our pseudo-label instances are not limited to a subset of thing classes (GT, left column). Our trained SAL-4D thus learns to densely segment all classes in space and time (right column). Importantly, pseudo-labels do not provide semantic labels, only CLIP tokens. For visualization, we prompt individual instances with prompts that conform to the SemanticKITTI class vocabulary.

Prompt examples. We visualize the output of our model (we highlight objects in orange) for four different prompts: two canonical car and bicycle rider, and two "arbitrary" object, advertising stand and electric street box. As can be seen, all are segmented correctly, including stationary and moving instances. Remarkably, all three different types of advertising stand, and both instances of electric street box are correctly segmented. We provide images for reference; images are not used as input to our model.

Zero-Shot 4D Lidar Panoptic Segmentation Benchmark

Benchmark Results. We compare SAL-4D to several supervised baselines for 4D Lidar Panoptic Segmentation and zero-shot baselines. While there is still a gap between supervised methods and zero-shot approaches, SAL-4D significantly narrows down this gap. On SemanticKITTI, our model SAL-4D reaches 59% of the top-performing supervised model, and on nuScenes, 72%, even though it is not trained using any labeled data.

Zero-Shot Baselines. We construct several baselines that associate single-scan 3D SAL (Osep et al., ECCV'24) predictions in time and require no temporal GT supervision. As SemanticKITTI is dominated by static objects, we propose a minimal viable Stationary World (SW) baseline that propagates single-scan masks solely via ego-motion. Furthermore, we adopt a strong Lidar Multi-Object Tracking (MOT) approach, which utilizes Kalman filters in conjunction with a linear assignment association. As a data-driven and model-centric baseline, the Video instance segmentation (VIS) baseline follows and directly associates objects by matching decoder object queries of the 3D SAL model in the embedding space.

Citation


    @inproceedings{zhang2025sal4d,
        title={{Zero-Shot 4D Lidar Panoptic Segmentation}},
        author={Zhang, Yushan and O\v{s}ep, Aljo\v{s}a and Leal-Taix\'{e}, Laura and Meinhardt, Tim},
        booktitle={Conference on Computer Vision and Pattern Recognition (CVPR)},
        year={2025},
    }

Paper

Zero-Shot 4D Lidar Panoptic Segmentation

Conference on Computer Vision and Pattern Recognition (CVPR), 2025

Yushan Zhang, Aljoša Ošep, Laura Leal-Taixé, Tim Meinhardt

description Paper

insert_comment BibTeX

Acknowledgment

We are grateful to Ayça Takmaz for her help with paper figures, Neehar Peri for his feedback on the paper and anonymous reviewers for their tips and insightful comments.