Abstract

We present 3DiffTection, a cutting-edge method for 3D detection from single images, grounded in features from a 3D-aware diffusion model. Annotating large-scale image data for 3D object detection is both resource-intensive and time-consuming. Recently, large image diffusion models have gained traction as potent feature extractors for 2D perception tasks. However, since these features, originally trained on paired text and image data, are not directly adaptable to 3D tasks and often misalign with target data, our approach bridges these gaps through two specialized tuning strategies: geometric and semantic. For geometric tuning, we refine a diffusion model on a view synthesis task, introducing a novel epipolar warp operator. This task meets two pivotal criteria: the necessity for 3D awareness and reliance solely on posed image data, which are readily available (e.g., from videos). For semantic refinement, we further train the model on target data using box supervision. Both tuning phases employ a ControlNet to preserve the integrity of the original feature capabilities. In the final step, we harness these capabilities to conduct a test-time prediction ensemble across multiple virtual viewpoints. Through this methodology, we derive 3D-aware features tailored for 3D detection and excel in identifying cross-view point correspondences.

3D Object Detection

Video and Image Side by Side

3D object detection results. The left demo shows the detection results of 3DiffTection from an unseen video without camera poses. Our method can achieve better performance compared to our baseline method. The right image shows the detection results on challenging cases, e.g., chairs are occluded by tables and the sink is rarely observed.

3D Correspondences

Correspondences demo: The video demonstrates that off-the-shelve Stable-Diffusion features fall short in accurately capturing correspondences within a 3D environment. In contrast, our 3DiffTection excels in identifying precise 3D correspondences.

Geometric ControlNet makes StableDiffusion 3D Aware

Geometric ControlNet: Left: Original Stable Diffusion UNet encoder block. Right: We train novel view image systhesis by adding a geometric ControlNet to the original Stable Diffusion encoder blocks. The geometric ControlNet receives the conditional view image as an additional input. Using the camera pose, we introduce an epipolar warp operator, which warps intermediate features into the target view. With the geometric ControlNet, we significantly improve the 3D awareness of pre-trained diffusion feature.

Single view novel-view synthesis at scene level

Single view novel-view synthesis results: We present our single view novel-view synthesis results at the scene level. Our proposed geometric ControlNet enables Stable-Diffusion to achieve novel view synthesis at scene level from single-view image while still preserving the original semantic representations.

Citation