QUEEN: QUantized Efficient ENcoding of Dynamic Gaussians for Streaming Free-viewpoint Videos

We develop efficient representations for streamable free-viewpoint videos with dynamic Gaussians. Our method QUEEN is able to capture dynamic scenes at high visual quality and reduce the model size to just 0.7 MB per frame while training in under 5 seconds and rendering at ∼350 FPS.

Abstract

Online free-viewpoint video (FVV) streaming is a challenging problem, which is relatively under-explored. It requires incremental on-the-fly updates to a volumetric representation, fast training and rendering to satisfy real-time constraints and a small memory footprint for efficient transmission. If achieved, it can enhance user experience by enabling novel applications, e.g., 3D video conferencing and live volumetric video broadcast, among others. In this work, we propose a novel framework for QUantized and Efficient ENcoding (QUEEN) for streaming FVV using 3D Gaussian Splatting (3D-GS). QUEEN directly learns Gaussian attribute residuals between consecutive frames at each time-step without imposing any structural constraints on them, allowing for high quality reconstruction and generalizability. To efficiently store the residuals, we further propose a quantization-sparsity framework, which contains a learned latent-decoder for effectively quantizing attribute residuals other than Gaussian positions and a learned gating module to sparsify position residuals. We propose to use the Gaussian viewspace gradient difference vector as a signal to separate the static and dynamic content of the scene. It acts as a guide for effective sparsity learning and speeds up training. On diverse FVV benchmarks, QUEEN outperforms the state-of-the-art online FVV methods on all metrics. Notably, for several highly dynamic scenes, it reduces the model size to just 0.7 MB per frame while training in under 5 sec and rendering at ~350 FPS.

Approach

Quantization-Sparsity Framework

Our approach learns streamable 3D Gaussian attribute residuals at each time-step. We develop a quantization-sparsity framework for compressing position residuals via sparsity and all other attributes via quantization. Our compressed latents are learned in an end-to-end differentiable manner. We also develop an adaptive masking technique to split static and dynamic Gaussians along with corresponding image regions to speed up per frame training.

Adaptive Masked Training and Improved Initialization

In addition to the quantization-sparsity framework, (a) we use the difference of Gaussian viewspace gradients as the signal for separating static and dynamic Gaussians and corresponding regions. By performing selective rendering and backpropagation, we speed up training per frame. (b) We propose to use off-the-shelf depth prediction network to complete the scene on top of the Gaussian initializations by COLMAP.

Citation

        @inproceedings{
            girish2024queen,
            title={{QUEEN}: {QU}antized Efficient {EN}coding for Streaming Free-viewpoint Videos},
            author={Sharath Girish and Tianye Li and Amrita Mazumdar and Abhinav Shrivastava and David Luebke and Shalini De Mello},
            booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
            year={2024},
            url={https://openreview.net/forum?id=7xhwE7VH4S}
        }