Toronto AI Lab

3DiffTection: 3D Object Detection with Geometry-Aware Diffusion Features

Chenfeng Xu 1,2
Huan Ling 1,3,4
Sanja Fidler 1,3,4
Or Litany 1,5

1NVIDIA
2UC Berkeley
3Vector Institute
4University of Toronto
5Technion

description Arxiv description BibTex description Code (coming soon)


Abstract

We present 3DiffTection, a cutting-edge method for 3D detection from single images, grounded in features from a 3D-aware diffusion model. Annotating large-scale image data for 3D object detection is both resource-intensive and time-consuming. Recently, large image diffusion models have gained traction as potent feature extractors for 2D perception tasks. However, since these features, originally trained on paired text and image data, are not directly adaptable to 3D tasks and often misalign with target data, our approach bridges these gaps through two specialized tuning strategies: geometric and semantic. For geometric tuning, we refine a diffusion model on a view synthesis task, introducing a novel epipolar warp operator. This task meets two pivotal criteria: the necessity for 3D awareness and reliance solely on posed image data, which are readily available (e.g., from videos). For semantic refinement, we further train the model on target data using box supervision. Both tuning phases employ a ControlNet to preserve the integrity of the original feature capabilities. In the final step, we harness these capabilities to conduct a test-time prediction ensemble across multiple virtual viewpoints. Through this methodology, we derive 3D-aware features tailored for 3D detection and excel in identifying cross-view point correspondences.




3D Object Detection

Video and Image Side by Side
Description of the image>
3D object detection results. The left demo shows the detection results of 3DiffTection from an unseen video without camera poses. Our method can achieve better performance compared to our baseline method. The right image shows the detection results on challenging cases, e.g., chairs are occluded by tables and the sink is rarely observed.



3D Correspondences



Correspondences demo: The video demonstrates that off-the-shelve Stable-Diffusion features fall short in accurately capturing correspondences within a 3D environment. In contrast, our 3DiffTection excels in identifying precise 3D correspondences.


Geometric ControlNet makes StableDiffusion 3D Aware



Geometric ControlNet: Left: Original Stable Diffusion UNet encoder block. Right: We train novel view image systhesis by adding a geometric ControlNet to the original Stable Diffusion encoder blocks. The geometric ControlNet receives the conditional view image as an additional input. Using the camera pose, we introduce an epipolar warp operator, which warps intermediate features into the target view. With the geometric ControlNet, we significantly improve the 3D awareness of pre-trained diffusion feature.


Single view novel-view synthesis at scene level



Single view novel-view synthesis results: We present our single view novel-view synthesis results at the scene level. Our proposed geometric ControlNet enables Stable-Diffusion to achieve novel view synthesis at scene level from single-view image while still preserving the original semantic representations.


Citation

@misc{xu20233difftection,
            title={3DiffTection: 3D Object Detection with Geometry-Aware Diffusion Features}, 
            author={Chenfeng Xu and Huan Ling and Sanja Fidler and Or Litany},
            year={2023},      
            eprint={2311.04391},
            archivePrefix={arXiv},
            primaryClass={cs.CV}    
      }
    

Related works