Estimating scene lighting from a single image or video remains a longstanding challenge in computer vision and graphics. Learning-based approaches are constrained by the scarcity of ground-truth HDR environment maps, which are expensive to capture and limited in diversity. While recent generative models offer strong priors for image synthesis, lighting estimation remains difficult due to its reliance on indirect visual cues, the need to infer global (non-local) context, and the recovery of high-dynamic-range outputs. We propose LuxDiT, a novel data-driven approach that fine-tunes a video diffusion transformer to generate HDR environment maps conditioned on visual input. Trained on a large synthetic dataset with diverse lighting conditions, our model learns to infer illumination from indirect visual cues and generalizes effectively to real-world scenes. To improve semantic alignment between the input and the predicted environment map, we introduce a low-rank adaptation finetuning strategy using a collected dataset of HDR panoramas. Our method produces accurate lighting predictions with realistic angular high-frequency details, outperforming existing state-of-the-art techniques in both quantitative and qualitative evaluations.
Method overview. Given an input image or video, LuxDiT predicts an environment map as two tone-mapped representations, guided by a light directional map. Environment maps are encoded with a VAE, and the resulting latents are concatenated and jointly processed with visual input by a DiT. The outputs are decoded and fused by a lightweight MLP to reconstruct the final HDR panorama.
Stage I: Synthetic supervised training. Unlike prior work, we train our model on a large-scale synthetic rendering dataset, where we use varying lighting conditions to render randomly create 3D synthetic scenes. This makes our focus on the rendering cues from input images rather than the harmonized outpainting.
Synthetic data samples.
Stage II: Semantic adaptation. After base training, we fine-tune the model to improve the semantic alignment between the input and the predicted environment map. This stage applies LoRA fine-tuning using perspective projections from real world HDR panoramas or panoramic videos.
Real data samples.
With this training strategy, LuxDiT from Stage I can accurately reconstruct highlights (LoRA scale → 0.0) from inputs;
then Stage II fine-tuning allows LuxDiT to generate environment maps semantically aligned with inputs (LoRA scale → 1.0).
Without using synthetic data, the model fails to reconstruct highlights and generates less realistic environment maps.
We extensively evaluate LuxDiT on a variety of test datasets. Predicted panoramas and three-sphere renderings will be shown for the visual evaluation.
@misc{liang2025luxdit,
title={LuxDiT: Lighting Estimation with Video Diffusion Transformer},
author={Ruofan Liang and Kai He and Zan Gojcic and Igor Gilitschenski and Sanja Fidler
and Nandita Vijaykumar and Zian Wang},
year={2025},
eprint={2509.03680},
archivePrefix={arXiv},
primaryClass={cs.GR}
}
|