Toronto AI Lab

ReMatching Dynamic Reconstruction Flow

Sara Oblak1 , Despoina Paschalidou1, Sanja Fidler1,2,3, Matan Atzmon1

1NVIDIA, 2University of Toronto, 3 Vector Institute

Reconstructing dynamic scenes from video inputs using models like Gaussian Splatting or NeRF is challenging due to the sparse nature of inputs, both in time and space. To address this issue, this work introduces the ReMatching framework - a novel approach for designing and integrating deformation priors into dynamic reconstruction models. The ReMatching framework has three core goals: i) suggest an optimization objective that improves generalization to unseen views and timestamps without compromising solutions’ fidelity levels; ii) ensure applicability to various model functions, including time-dependent rendered pixels, particles representing scene geometry (as in Gaussian Splatting), or a NeRF-based density field; and, iii) provide a flexible design of deformation prior classes, allowing more complex classes to be built from simpler ones.


Paper

Sara Oblak, Despoina Paschalidou, Sanja Fidler, Matan Atzmon

ReMatching Dynamic Reconstruction Flow

[preprint] [bibtex][code: coming soon]


Key Results

We compared our framework against recent state-of-the-art dynamic models, including Deformable 3D Gaussians (D3G) (Yang et al., 2023), 3D Geometry-aware Deformable Gaussians (DAG) (Lu et al., 2024), Neural Parametric Gaussians (NPA) (Das et al.,2024), and K-Planes (Fridovich-Keil et al., 2023). Notably, some of these baselines incorporate prior regularization losses such as local rigidity and smoothness to their optimization procedure. Our results demonstrate key improvements by: i) generating plausible reconstructions that avoid unrealistic distortions. ii) reducing rendering artifacts of extraneous fragments, especially in moving parts.
Our reconstruction results in comparison to D3G (Yang et al., 2023) on scenes from the D-NeRF synthetic dataset.



Comparison of the movement of Gaussian centers on a scene from the D-NeRF synthetic dataset. We color the centers in the shovel of the lego digger based on their distance from the base plate in the starting frame.


Comparison of reconstruction results on a scene from the HyperNeRF real-world dataset. Note that video input was taken using a phone camera.


Quantitative comparison of multiple baselines and two variants of our model (with divergence free and rigid priors) on the D-NeRF dataset.


Comparison of our adaptive prior learned decompositions with the Segment Anything Model (SAM) (Kirillov et al., 2023) reveals significant differences. SAM frequently over-segments scenes, relying heavily on color variations, and often fails to accurately capture the underlying geometric structure of scene components. Additionally, SAM tends to merge parts that should move independently, such as the arms and body in human scenes. This feature is particularly critical for downstream applications like editing and animation.