Learning to Segment Anything in Lidar-4D:
Prior methods (left) for zero-shot Lidar panoptic segmentation (Osep et al., ECCV'24) process individual (3D) point clouds in isolation. In contrast, our data-driven approach (right) operates directly on sequences of point clouds, jointly performing object segmentation, tracking, and zero-shot recognition based on text prompts specified at test time.
Our method localizes and tracks any object and provides a temporally coherent semantic interpretation of dynamic scenes. We can correctly segment canonical objects, such as car, and objects beyond the vocabularies of standard Lidar datasets, such as advertising stand.
Abstract
We propose SAL-4D (Segment Anything in Lidar-4D), a method that utilizes multi-modal robotic sensor setups as a bridge to distill recent developments in Video Object Segmentation (VOS) in conjunction with off-the-shelf Vision-Language foundation models to Lidar.
We utilize VOS models to pseudo-label tracklets in short video sequences, annotate these tracklets with sequence-level CLIP tokens, and lift them to the 4D Lidar space using calibrated multi-modal sensory setups to distill them to our SAL-4D model.
Due to temporal consistent predictions, we outperform prior art in 3D Zero-Shot Lidar Panoptic Segmentation (LPS) for over 5 points in terms of PQ, and unlock Zero-Shot 4D-LPS.