Overview
Real-world data collection for embodied agents remains costly and unsafe, calling for scalable, realistic, and simulator-ready 3D environments. However, existing scene-generation systems often rely on rule-based or task-specific pipelines, yielding artifacts and physically invalid scenes. We present SAGE, an agentic framework that, given a user-specified embodied task (e.g., "pick up a bowl and place it on the table"), understands the intent and automatically generates simulation-ready environments at scale. The agent couples multiple generators for layout and object composition with critics that evaluate semantic plausibility, visual realism, and physical stability. Through iterative reasoning and adaptive tool selection, it self-refines the scenes until meeting user intent and physical validity. The resulting environments are realistic, diverse, and directly deployable in modern simulators for policy training. Policies trained purely on this data exhibit clear scaling trends and generalize to unseen objects and layouts, demonstrating the promise of simulation-driven scaling for embodied AI.
Generated Scenes
Single-Room Scenes
SAGE generates realistic, diverse, and semantically coherent scenes spanning various styles and functionalities, from Bedroom and Office spaces to creative themes like "Cyberpunk game den" and "Starry-night bedroom".
Bedroom
Living room
Fairy-tale princess room
Rusty and dusty restroom
Gym
Office
Cyberpunk game den
Starry-night bedroom
Meeting room
Children room
Golden and luxury bedroom
Muddy and dirty dining room
Multi-Room Scenes
SAGE can be extended to generate multi-room scenes at scale easily by generating the floor plan and then calling generator MCP tools to multiple rooms in parallel.
The student apartment with one bedroom
The student apartment with two bedrooms
Mid-century modern family home
Craft supply hoarder's bungalow
Multilingual teacher's apartment
Naturalist's cabin
Image-Conditioned Scenes
Thanks to the capability of Agentic VLM model Qwen3-VL, SAGE can be conditioned on a reference image. Although the agent is not able to generate pixel-aligned scenes, it can generate scenes that are semantically coherent with the reference image.
Ref Image
Ref Image
Ref Image
Generated Scene
Generated Scene
Generated Scene
Evaluation & Validation
Physical Stability
The critic agent evaluates physical stability and refines placement until scenes pass physical validity checks.
Embodied Policy Training
Policies trained purely on SAGE-generated data exhibit clear scaling trends and generalize to unseen objects and layouts, demonstrating the promise of simulation-driven scaling for embodied AI.
SAGE-10k Dataset
SAGE-10k is a large-scale interactive indoor scene dataset featuring realistic layouts, generated by the agentic-driven pipeline introduced in SAGE. The dataset contains 10,000 diverse scenes spanning 50 room types and styles, along with 565K uniquely generated 3D objects.
Citation
@inproceedings{xia2026sage,
title={SAGE: Scalable Agentic 3D Scene Generation for Embodied AI},
author={Xia, Hongchi and Li, Xuan and Li, Zhaoshuo and Ma, Qianli and Xu, Jiashu and Liu, Ming-Yu and Cui, Yin and Lin, Tsung-Yi and Ma, Wei-Chiu and Wang, Shenlong and Song, Shuran and Wei, Fangyin},
booktitle={IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2026}
}