Abstract

Synthesizing interactive 3D scenes from text is essential for gaming, virtual reality, and embodied AI. However, existing methods face several challenges. Learning-based approaches depend on small-scale indoor datasets, limiting the scene diversity and layout complexity. While large language models (LLMs) can leverage diverse text-domain knowledge, they struggle with spatial realism, often producing unnatural object placements that fail to respect common sense. Our key insight is that vision perception can bridge this gap by providing realistic spatial guidance that LLMs lack. To this end, we introduce Scenethesis, a training-free agentic framework that integrates LLM-based scene planning with vision-guided layout refinement. Given a text prompt, Scenethesis first employs an LLM to draft a coarse layout. A vision module then refines it by generating an image guidance and extracting scene structure to capture inter-object relations. Next, an optimization module iteratively enforces accurate pose alignment and physical plausibility, preventing artifacts like object penetration and instability. Finally, a judge module verifies spatial coherence. Comprehensive experiments show that Scenethesis generates diverse, realistic, and physically plausible 3D interactive scenes, making it valuable for virtual content creation, simulation environments, and embodied AI research.

More Results

Scenethesis generates high-fidelity 3D scenes from user prompts, encompassing both indoor and outdoor environments.

A private billiard room for relax

A reading corner

A beautiful surf-fishing beach

A warehouse space

Different results given the same text prompt

A children playroom

A car in garage

A home gym

Pipeline

Starting with a user-provided text prompt, Scenethesis begins with a coarse scene planning stage, where a large language model (LLM) generates a list of objects commonly found in the specified scene and selects one object as the primary object. Scenethesis then designs a coarse layout and produces a detailed prompt of the scene.

In the layout visual refinement stage, a vision module first generates a detailed image, which serves as the guidance for layout optimization. The vision module then leverages different off-the-shelf vision models to extract a scene graph with predicted 5DoF poses for the initial scene setup and retrieves relevant 3D assets and an environment map.

Next, a novel optimization module iteratively enforces accurate pose alignment and physical plausibility, preventing artifacts like object penetration and instability.

Finally, a scene judge module verifies spatial coherence. If the scene is not satisfactory, the process will repeat. Once the scene is satisfactory, the generated scene will be returned to the user.

Scenethesis Pipeline

LLM Module: Coarse Scene Planning

Based on the user prompt, the LLM first generates a list of common object categories and an upsampled prompt describing a coarse spatial hierarchy.

Vision Module: Layout Visual Refinement

Given the upsampled prompt and object categories, the vision module leverages vision foundation models to generate a detailed image as guidance, a scene graph with object poses, and assets including 3D objects and an environment map.

Optimization Module: Physics-aware Optimization

Directly using the planned scene graph layout often leads to physically implausible scenes, with estimated 5DoF poses misaligned with the image guidance due to discrepancies between retrieved objects and their visual counterparts. To resolve this, Scenethesis employs a layout optimization process that adjusts object dimensions, positions, and orientations, ensuring spatial alignment and physical plausibility.

Judge Module: Spatial Coherence Judgment

Finally, a scene judge module verifies spatial coherence. If the scene is not satisfactory, the process will repeat. Once the scene is satisfactory, the generated scene will be returned to the user.

Citation

@inproceedings{ling2024scenethesis,
  title={Scenethesis: Combining Language and Visual Priors for 3D Scene Generation},
  author={Ling, Lu and Lin, Chen-Hsuan and Lin, Tsung-Yi and Ding, Yifan and Zeng, Yu and Sheng, Yichen and Ge, Yunhao and Liu, Ming-Yu and Bera, Aniket and Li, Zhaoshuo},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2026}
}