LuxDiT

Ruofan Liang^1,2,3 Kai He^1,2,3 Zan Gojcic¹ Igor Gilitschenski^2,3
Sanja Fidler^1,2,3 Nandita Vijaykumar^{2,3 †} Zian Wang^{1,2,3 †}

¹NVIDIA ²University of Toronto ³Vector Institute ^†joint advising

description arXiv description Paper description Code (Coming Soon) description BibTeX

LuxDiT is a generative lighting estimation model that predicts high-quality HDR envi-ronment maps from visual input. It produces accurate lighting while preserving scene semantics,enabling realistic virtual object insertion under diverse conditions.

Abstract

Estimating scene lighting from a single image or video remains a longstanding challenge in computer vision and graphics. Learning-based approaches are constrained by the scarcity of ground-truth HDR environment maps, which are expensive to capture and limited in diversity. While recent generative models offer strong priors for image synthesis, lighting estimation remains difficult due to its reliance on indirect visual cues, the need to infer global (non-local) context, and the recovery of high-dynamic-range outputs. We propose LuxDiT, a novel data-driven approach that fine-tunes a video diffusion transformer to generate HDR environment maps conditioned on visual input. Trained on a large synthetic dataset with diverse lighting conditions, our model learns to infer illumination from indirect visual cues and generalizes effectively to real-world scenes. To improve semantic alignment between the input and the predicted environment map, we introduce a low-rank adaptation finetuning strategy using a collected dataset of HDR panoramas. Our method produces accurate lighting predictions with realistic angular high-frequency details, outperforming existing state-of-the-art techniques in both quantitative and qualitative evaluations.

Method overview. Given an input image or video, LuxDiT predicts an environment map as two tone-mapped representations, guided by a light directional map. Environment maps are encoded with a VAE, and the resulting latents are concatenated and jointly processed with visual input by a DiT. The outputs are decoded and fused by a lightweight MLP to reconstruct the final HDR panorama.

Training Strategy

Stage I: Synthetic supervised training. Unlike prior work, we train our model on a large-scale synthetic rendering dataset, where we use varying lighting conditions to render randomly create 3D synthetic scenes. This makes our focus on the rendering cues from input images rather than the harmonized outpainting.

Synthetic data samples.
Stage II: Semantic adaptation. After base training, we fine-tune the model to improve the semantic alignment between the input and the predicted environment map. This stage applies LoRA fine-tuning using perspective projections from real world HDR panoramas or panoramic videos.

Real data samples.
With this training strategy, LuxDiT from Stage I can accurately reconstruct highlights (LoRA scale → 0.0) from inputs; then Stage II fine-tuning allows LuxDiT to generate environment maps semantically aligned with inputs (LoRA scale → 1.0).

Without using synthetic data, the model fails to reconstruct highlights and generates less realistic environment maps.

Input

W/o Syn. Data

Ours with LoRA

LoRA Scale: 0.0

Results

We extensively evaluate LuxDiT on a variety of test datasets. Predicted panoramas and three-sphere renderings will be shown for the visual evaluation.

Image Lighting Estimation

panorama_horizontal Laval Indoor panorama_horizontal Laval Outdoor panorama_horizontal Poly Haven

Input

StyleLight

DiffusionLight

LuxDiT

Reference

Video Lighting Estimation

panorama_horizontalPoly Haven Video panorama_horizontalWEB360 Video

Input Video

DiffusionLight

Ours (image)

Ours (video)

Reference

Application: Virtual Object Insertion

StyleLight

DiffusionLight

DiPIR

LuxDiT

Reference
H-G et al.

NLFE

DiffusionLight

DiPIR

LuxDiT

Paper

LuxDiT: Lighting Estimation with Video Diffusion Transformer

Ruofan Liang, Kai He, Zan Gojcic, Igor Gilitschenski, Sanja Fidler, Nandita Vijaykumar^†, Zian Wang^†

description arXiv

description Paper

BibTeX

@misc{liang2025luxdit,
    title={LuxDiT: Lighting Estimation with Video Diffusion Transformer},
    author={Ruofan Liang and Kai He and Zan Gojcic and Igor Gilitschenski and Sanja Fidler 
        and Nandita Vijaykumar and Zian Wang},
    year={2025},
    eprint={2509.03680},
    archivePrefix={arXiv},
    primaryClass={cs.GR}
}