Abstract
Designing 3D scenes is traditionally a challenging and laborious task that demands both artistic expertise and proficiency with complex software. Recent advances in text-to-3D generation have greatly simplified this process by letting users create scenes based on simple text descriptions. However, as these methods generally require extra training or in-context learning, their performance is often hindered by the limited availability of high-quality 3D data. In contrast, modern text-to-image models learned from web-scale images can generate scenes with diverse, reliable spatial layouts and consistent, visually appealing styles. Our key insight is that instead of learning directly from 3D scenes, we can leverage generated 2D images as an intermediary to guide 3D synthesis. In light of this, we introduce ARTiScene, a training-free automated pipeline for scene design that integrates the flexibility of free-form text-to-image generation with the diversity and reliability of 2D intermediary layouts. We generate the 2D intermediary image from a scene description, extract object shapes and appearances, create 3D models, and assemble them into the final scene with geometry, position, and pose extracted from the same image. Being generalizable to a wide range of scenes and styles, ARTiScene is shown to outperform state-of-the-art benchmarks by a large margin in layout and aesthetic quality by quantitative metrics. It also averages a 74.89% winning rate in extensive user studies and 95.07% in GPT evaluation.
Pipeline
Taking a text prompt as input, ARTiScene first prompts a text-to-image model for an image intermediary (yellow line). Then through object detection, inpainting, and prompting ChatGPT to describe the detected objects' appearance and geometry, we acquire a 3D model for each object (blue line). In parallel, we combine monocular depth estimation with the formerly detected 2D bounding boxes to estimate the 3D bounding boxes of each object (red line). We also synthesize the floor and wall textures for indoor scenes (green line). In the end, we assemble these acquired models and layout information to arrive at the final 3D scene.
Results
Benefiting from the high-quality guidance of the image intermediaries, our results are more visually appealing than previous state-of-the-arts, which all assemble the final scene by asset retrieval. As we successfully detect and faithfully generate most of the objects in the image, our scene has much richer content. For walls and floors, other retrieval-based methods use a fixed texture library, whereas we generate texture images with diffusion models during runtime, which also adds great variation and elevates the overall atmosphere of the scene. All assets in our results are generated with Edify-3D.
Diverse Scene Categories
a dental office
a game room
a meeting room
a nursery
a prison cell
a waiting room
a music studio
an operating room
a deli
a video store
a cute living room
a bakery
Diverse Styles
In each of the rows below, we fix the scene category (clinic room, bedroom, and bathroom) and change the styles. As can be seen from these renderings, ARTiScene not only generalizes across various categories, but also across styles.
a pink-themed clinic room
a Barbie-themed clinic room
a Barbie-themed clinic room
a purple-themed bedroom
a space-themed bedroom
a bohemian-styled bedroom
a space-themed bathroom
a teenager-styled bathroom
a Victorian-styled bathroom
Application: Local Editing
An important advantage of our method is its composibility and editability. As all objects are generated separately, we can easily change the appearance of one object without influencing the others. Here we take the segmented car image that was produced when generating our original scene on the left, and use Instruct-Pix2Pix to turn it into a yellow Porsche. Then we rerun the following steps to acquire a scene with the new car.
a garage (before editing)
a garage (after editing)
Generalization: Outdoor Scenes
As we make minimal assumptions about the target scene category, ARTiScene also generalizes to outdoor scenes after small modifications.
a Chinese garden
an Indian temple
Citation
@inproceedings{artiscene2025,
title={ArtiScene: Language-Driven Artistic 3D Scene Generation Through Image Intermediary},
author={Gu, Zeqi and Cui, Yin and Li, Max and Wei, Fangyin and Ge, Yunhao and Gu, Jinwei and Liu, Ming-Yu and Davis, Abe and Ding, Yifan},
booktitle={IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2025}
}