AI 3D Selfie: Real-Time Single-Image 3D Face Reconstruction for Light-Field Displays

We present AI 3D Selfie, a system that enables users to capture their facial images using a single 2D camera and visualize them in 3D in real time. Our method performs real-time single-shot 3D reconstruction by employing a triplane-based NeRF encoder and a fast volumetric rendering algorithm to display the results on a light field display.

Play4D: Accelerated and Interactive Free-viewpoint Video Streaming for Virtual Reality and Light Field Displays

We present Play4D, an accelerated and interactive free-viewpoint video (FVV) streaming pipeline for next-generation light field (LF) and virtual reality (VR) displays. Play4D integrates 4D Gaussian Splatting (4DGS) reconstruction, compression and streaming with an efficient radiance field rendering algorithm to enable live 6-DoF user interaction with photorealistic dynamic scenes.

Real-time 3D Visualization of Radiance Fields on Light Field Displays

Radiance fields have revolutionized photo-realistic 3D scene visualization by enabling high-fidelity reconstruction of complex environments, making them an ideal match for light field displays. However, integrating these technologies presents significant computational challenges, as light field displays require multiple high-resolution renderings from slightly shifted viewpoints, while radiance fields rely on computationally intensive volume rendering. In this paper, we propose a unified and efficient framework for real-time radiance field rendering on light field displays.

Seeing What Matters: Generalizable AI-generated Video Detection with Forensic-Oriented Augmentation

Synthetic video generation is progressing very rapidly. The latest models can produce very realistic high-resolution videos that are virtually indistinguishable from real ones.  Although several video forensic detectors have been recently proposed, they often exhibit poor generalization, which limits their applicability in a real-world scenario. Our key insight to overcome this issue is to guide the detector towards seeing what really matters.

GAIA: Generative Animatable Interactive Avatars with Expression-conditioned Gaussians

3D generative models of faces trained on in-the-wild image collections have improved greatly in recent times, offering better visual fidelity and view consistency. Making such generative models animatable is a hard yet rewarding task, with applications in virtual AI agents, character animation, and telepresence.

Coherent 3D Portrait Video Reconstruction via Triplane Fusion

Recent breakthroughs in single-image 3D portrait reconstruction have enabled telepresence systems to stream 3D portrait videos from a single camera in real-time, democratizing telepresence. However, per-frame 3D reconstruction exhibits temporal inconsistency and forgets the user’s appearance. On the other hand, self-reenactment methods can render coherent 3D portraits by driving a 3D avatar built from a single reference image but fail to faithfully preserve the user’s per-frame appearance (e.g., instantaneous facial expressions and lighting).

Identity-Motion Trade-offs in Text-to-Video Generation

Text-to-video diffusion models have shown remarkable progress in generating coherent video clips from textual descriptions. However, the interplay between motion, structure, and identity representations in these models remains under-explored. Here, we investigate how self-attention query (Q) features simultaneously govern motion, structure, and identity and examine the challenges arising when these representations interact.

BLADE: Single-view Body Mesh Estimation through Accurate Depth Estimation

Single-image human mesh recovery is a challenging task due to the ill-posed nature of simultaneous body shape, pose, and camera estimation. Existing estimators work well on images taken from afar, but they break down as the person moves close to the camera. Moreover, current methods fail to achieve both accurate 3D pose and 2D alignment at the same time. Error is mainly introduced by inaccurate perspective projection heuristically derived from orthographic parameters.

SimAvatar: Simulation-Ready Clothed Gaussian Avatars from Text

We introduce SimAvatar, a framework designed to generate simulation-ready clothed 3D human avatars from a text prompt. Current text-driven human avatar generation methods either model hair, clothing, and the human body using a unified geometry or produce hair and garments that are not easily adaptable for simulation within existing simulation pipelines.

QUEEN: QUantized Efficient ENcoding for Streaming Free-viewpoint Videos

Online free-viewpoint video (FVV) streaming is a challenging problem, which is relatively under-explored. It requires incremental on-the-fly updates to a volumetric representation, fast training and rendering to satisfy realtime constraints and a small memory footprint for efficient transmission. If acheived, it can enhance user experience by enabling novel applications, e.g., 3D video conferencing and live volumetric video broadcast, among others. In this work, we propose a novel framework for QUantized and Efficient ENcoding (QUEEN) for streaming FVV using 3D Gaussian Splatting (3D-GS).