3DGUT

¹ NVIDIA

² University of Toronto

^* Equal Contribution

CVPR 2025 (Oral)

history_edu arXiv description Paper (117MB) play_circle Video description Code

Abstract

3D Gaussian Splatting (3DGS) enables efficient reconstruction and high-fidelity real-time rendering of complex scenes on consumer hardware. However, due to its rasterization-based formulation, 3DGS is constrained to ideal pinhole cameras and lacks support for secondary lighting effects. Recent methods address these limitations by tracing the particles instead, but, this comes at the cost of significantly slower rendering. In this work, we propose 3D Gaussian Unscented Transform (3DGUT), replacing the EWA splatting formulation with the Unscented Transform that approximates the particles through sigma points, which can be projected exactly under any nonlinear projection function. This modification enables trivial support of distorted cameras with time dependent effects such as rolling shutter, while retaining the efficiency of rasterization. Additionally, we align our rendering formulation with that of tracing-based methods, enabling secondary ray tracing required to represent phenomena such as reflections and refraction within the same 3D representation.

When projecting a Gaussian particle from 3D space onto the camera image plane, Monte Carlo sampling (left) provides the most accurate estimate but is costly to compute. EWA Splatting formulation used in 3DGS approximates the projection function via linearization, which requires a dedicated Jacobian J for each camera model and leads to approximation errors with increasing distortion. Unscented Transform instead approximates the particle with Sigma points than can be projected exactly and from which the 2D conic can then be estimated.

For a given ray, 3DGS evaluates the response of the Gaussian particle in 2D after the projection onto the camera image plane. This requires backpropagation through the (approximated) projection function. Instead, we follow 3DGRT and evaluate particles in 3D at the point of the maximum response along the ray.

Comparison to 3D Gaussian Splatting (3DGS)

Qualitative comparison of our novel-view synthesis results against 3DGS on the MipNERF360 dataset. 3DGUT achieve comparable perceptual quality.

Comparison to FisheyeGS

The unscented transform enables our method to support complex camera models, such as fisheye cameras, without requiring a true ray-tracing formulation. We compare our approach against FisheyeGS, demonstrating through both quantitative and qualitative evaluations that 3DGUT significantly outperforms FisheyeGS across all perceptual metrics. Notably, 3DGUT achieves this with fewer than half the particles (0.38M vs. 1.07M). While FisheyeGS relies on deriving a Jacobian specific to this particular fisheye camera model—restricting its generalizability even to closely related models (e.g., fisheye cameras with distortions)—our simple yet robust formulation delivers superior performance and can be effortlessly adapted to any camera model.

Rolling Shutter Cameras

Apart from the modeling of distorted cameras, 3DGUT can also faithfully incorporate the camera motion into the projection formulation, hence offering support for time-dependent camera effects such as rolling-shutter, which are commonly encountered in the fields of autonomous driving and robotics. Although optical distortion can be addressed with image rectification^[1], incorporating time-dependency of the projection function in the linearization framework is highly non-trivial.

[1] Image rectification is generally effective only for low-FoV cameras and results in information loss.

Reflections and Refractions

Our method enables the simulation of reflections and refractions—effects traditionally achievable only through ray tracing—using a hybrid rendering scheme. Specifically, we begin by computing all primary ray intersections with the scene. These primary rays are then rendered using our splatting method by discarding Gaussian hits that fall behind a ray's closest intersection. Finally, we compute secondary rays and trace them using 3DGRT. This capability is made possible by our method's ability to generate a 3D representation fully consistent with 3DGRT.

AV scene reconstruction

Real-world AV and robotics applications often need to account for distorted intrinsic camera models and time-dependent effects like rolling shutter distortions caused by high sensor speeds. 3DGUT (sorted) can faithfully handle these effects naturally and reaches comparable performance to ray tracing-based reconstruction methods. Below, we show qualitative results on the Waymo dataset against 3DGRT.

Qualitative comparison of our novel-view synthesis results against 3DGRT on the Waymo dataset.

Gaussian Projection Quality

While Monte Carlo sampling is expensive to compute, it provides accurate reference distributions for assessing the quality of both EWA () and our UT-based projection () methods. This assessment can be quantified using the Kullback-Leibler (KL) divergence between both 2D distributions, where lower KL values indicate the projected Gaussians better approximate the reference projections. In the figure below, we evaluate the KL divergence for a fixed reconstruction. Specifically, for each visible Gaussian, we compare the projections obtained using either method under different camera and pose configurations against MC-based references (using 500 samples per reference). The resulting KL divergence distributions are visualized in the histograms at the bottom.

While both distributions of divergences are consistent for the static pinhole camera case (first column), UT-based projections are more accurate compared to EWA-based estimates for the static fisheye camera case (third column), indicating that UT yields a better approximation in case of higher non-linearity of the projection. For rolling-shutter camera poses (second and fourth columns), RS-aware UT-based projections still approximate the RS-aware MC references well. In contrast, RS-unaware EWA linearizations break down and fail to approximate this case (histogram domains are capped to 0.04 for clearer visualization, but the EWA-based projections have a long tail distribution of larger KL values still). The tearing artifacts observed in EWA-based RS renderings arise from these inaccurate projections, leading to incorrect pixel-to-Gaussian associations during the volume rendering step.

BibTeX

@article{wu20253dgut,
    title={3DGUT: Enabling Distorted Cameras and Secondary Rays in Gaussian Splatting},
    author={Wu, Qi and Martinez Esturo, Janick and Mirzaei, Ashkan and Moenne-Loccoz, Nicolas and Gojcic, Zan},
    journal={Conference on Computer Vision and Pattern Recognition (CVPR)},
    year={2025}
}