Dynamic Vision and Learning Group NVIDIA Research
Towards Learning to Complete Anything in Lidar
CAL Logo

Towards Learning to Complete Anything in Lidar

1 NVIDIA
2 ETH Zurich
3 Carnegie Mellon University

* work done during an internship at NVIDIA

Complete Anything in Lidar (CAL): Given a sparse Lidar point cloud, CAL localizes, reconstructs, and, optionally, recognizes objects in a zero-shot fashion. By providing a semantic class vocabulary of specific object classes at test time, CAL can be prompted to perform Semantic Scene Completion (SSC), Panoptic Scene Completion (PSC), or (amodal) 3D Object Detection. Please note that CAL only takes a single Lidar scan as input; RGB images are shown for visualization purposes only.

Abstract


We propose CAL (Complete Anything in Lidar) for Lidar-based shape-completion in-the-wild. This is closely related to Lidar-based semantic/panoptic scene completion. However, contemporary methods can only complete and recognize objects from a closed vocabulary labeled in existing Lidar datasets. Different to that, our zero-shot approach leverages the temporal context from multi-modal sensor sequences to mine object shapes and semantic features of observed objects. These are then distilled into a Lidar-only instance-level completion and recognition model. Although we only mine partial shape completions, we find that our distilled model learns to infer full object shapes from multiple such partial observations across the dataset. We show that our model can be prompted on standard benchmarks for Semantic and Panoptic Scene Completion, localize objects as (amodal) 3D bounding boxes, and recognize objects beyond fixed class vocabularies.

CAL Pseudo-Labeling Engine


Given a calibrated RGB camera and Lidar sensor, (1) we use a video-object segmentation model (SAM2) to localize object instances in video, (2) pseudo-label the Lidar point clouds over time, and (3) generate completed voxelized object representations, each enriched with a per-instance CLIP feature extracted from RGB images. In (4), we accumulate 360o Lidar scans to obtain full-scene binary occupancy, used for refining the aggregated pseudo-labels (3) via a CRF-guided label refinement process (5). As output (6), our method pairs each sparse and incomplete Lidar scan with pseudo-labels for object-level scene completion (top-right) and CLIP features, which are temporally aggregated by averaging per-instance features across the sequence. These CLIP features enable zero-shot recognition via text queries (bottom-right). Mined pseudo-label pairs are then used to train the CAL model.

CAL Zero-Shot Model


CAL model architecture and training pipeline is shown below. CAL backbone consists of a sparse encoder and a dense 3D convolutional block. We estimate scene-level occupancy using a multi-scale sparse generative decoder that consists of decoder blocks D, two occupancy heads Bo and Bs, and a pseudo-semantic head (S) at each scale L. The Transformer decoder then predicts segmentation masks over the completed scene and regresses CLIP features.

Key Results


Zero-Shot Panoptic Scene Completion (ZS-PSC)

Input Scan
Completion + Masks
Completion + Prompting
Ground Truth

Zero-Shot Panoptic Scene Completion (ZS-PSC) results on SemanticKITTI. Given a single Lidar scan (1st column), CAL completes object-level observations as a set of masks over the voxel grid (2nd column) and predicts a CLIP feature for each mask. We can prompt with any semantic class vocabulary and obtain panoptic and semantic scene completion (3rd column) results. Our model predicts shape priors for both thing (e.g., "car", "cyclist") and stuff classes (e.g. "vegetation", "road") and can correctly predict the intersection geometry in 4th row, despite limited direct evidence.


Input Scan
Completion + Masks
Zero-Shot Prompting

Completion and amodal detection on KITTI-360. Given an input Lidar scan (left), CAL outputs a set of completed object shapes (middle). We visualize recognized objects (right) for queries "vehicle" (top), and "tree" (bottom), and fit 3D bounding boxes to the identified object instances, demonstrating the zero-shot amodal 3D object detection ability of our method.

Panoptic Scene Completion Benchmark Results

Panoptic Scene Completion Benchmark Results. We compare CAL against LMSCNet (Roldao et al., 2020) + MaskPLS (Marcuzzi et al., 2023), JS3CNet (Yan et al., 2021) + MaskPLS, SCPNet (Xia et al., 2023) + MaskPLS, and PaSCo (Cao et al., 2024) (M=1 and Ensemble).


Citation



    @article{takmaz2025cal,
        title={{Towards Learning to Complete Anything in Lidar}},
        author={Ayca Takmaz and Cristiano Saltori and Neehar Peri and Tim Meinhardt and
                Riccardo de Lutio and Laura Leal-Taixé and Aljosa Osep},
        journal={ArXiv},
        year={2025},
        volume={abs/2504.12264},
    }

Paper


Towards Learning to Complete Anything in Lidar

Pre-print, 2025

Ayca Takmaz, Cristiano Saltori, Neehar Peri, Tim Meinhardt, Riccardo de Lutio, Laura Leal-Taixé, Aljosa Osep

description Paper
insert_comment BibTeX

Acknowledgment


We are grateful to Zan Gojcic and Dávid Rozenberszki for their feedback on the paper and their insightful comments.