NVIDIA Spatial Intelligence Lab NVIDIA Research
LuxDiT

LuxDiT: Lighting Estimation with Video Diffusion Transformer

Ruofan Liang1,2,3       Kai He1,2,3       Zan Gojcic1       Igor Gilitschenski2,3      
Sanja Fidler1,2,3       Nandita Vijaykumar2,3 †       Zian Wang1,2,3 †
1NVIDIA       2University of Toronto       3Vector Institute       joint advising
LuxDiT Teaser Image

LuxDiT is a generative lighting estimation model that predicts high-quality HDR envi-ronment maps from visual input. It produces accurate lighting while preserving scene semantics,enabling realistic virtual object insertion under diverse conditions.

Abstract


Estimating scene lighting from a single image or video remains a longstanding challenge in computer vision and graphics. Learning-based approaches are constrained by the scarcity of ground-truth HDR environment maps, which are expensive to capture and limited in diversity. While recent generative models offer strong priors for image synthesis, lighting estimation remains difficult due to its reliance on indirect visual cues, the need to infer global (non-local) context, and the recovery of high-dynamic-range outputs. We propose LuxDiT, a novel data-driven approach that fine-tunes a video diffusion transformer to generate HDR environment maps conditioned on visual input. Trained on a large synthetic dataset with diverse lighting conditions, our model learns to infer illumination from indirect visual cues and generalizes effectively to real-world scenes. To improve semantic alignment between the input and the predicted environment map, we introduce a low-rank adaptation finetuning strategy using a collected dataset of HDR panoramas. Our method produces accurate lighting predictions with realistic angular high-frequency details, outperforming existing state-of-the-art techniques in both quantitative and qualitative evaluations.


Method overview. Given an input image or video, LuxDiT predicts an environment map as two tone-mapped representations, guided by a light directional map. Environment maps are encoded with a VAE, and the resulting latents are concatenated and jointly processed with visual input by a DiT. The outputs are decoded and fused by a lightweight MLP to reconstruct the final HDR panorama.

Training Strategy


  • Stage I: Synthetic supervised training. Unlike prior work, we train our model on a large-scale synthetic rendering dataset, where we use varying lighting conditions to render randomly create 3D synthetic scenes. This makes our focus on the rendering cues from input images rather than the harmonized outpainting.

    Training Data Samples

    Synthetic data samples.

  • Stage II: Semantic adaptation. After base training, we fine-tune the model to improve the semantic alignment between the input and the predicted environment map. This stage applies LoRA fine-tuning using perspective projections from real world HDR panoramas or panoramic videos.

    Real data samples.

  • With this training strategy, LuxDiT from Stage I can accurately reconstruct highlights (LoRA scale → 0.0) from inputs; then Stage II fine-tuning allows LuxDiT to generate environment maps semantically aligned with inputs (LoRA scale → 1.0).

    Without using synthetic data, the model fails to reconstruct highlights and generates less realistic environment maps.

    Input
    W/o Syn. Data
    Ours with LoRA

Results


We extensively evaluate LuxDiT on a variety of test datasets. Predicted panoramas and three-sphere renderings will be shown for the visual evaluation.

Image Lighting Estimation



Input
StyleLight
DiffusionLight
LuxDiT
Reference

Video Lighting Estimation



Input Video
DiffusionLight
Ours (image)
Ours (video)
Reference

Application: Virtual Object Insertion

  • StyleLight
    DiffusionLight
    DiPIR
    LuxDiT
    Reference
  • H-G et al.
    NLFE
    DiffusionLight
    DiPIR
    LuxDiT

Paper



LuxDiT: Lighting Estimation with Video Diffusion Transformer

Ruofan Liang, Kai He, Zan Gojcic, Igor Gilitschenski, Sanja Fidler, Nandita Vijaykumar, Zian Wang

description arXiv
description Paper

BibTeX


@misc{liang2025luxdit,
    title={LuxDiT: Lighting Estimation with Video Diffusion Transformer},
    author={Ruofan Liang and Kai He and Zan Gojcic and Igor Gilitschenski and Sanja Fidler 
        and Nandita Vijaykumar and Zian Wang},
    year={2025},
    eprint={2509.03680},
    archivePrefix={arXiv},
    primaryClass={cs.GR}
}