Overview

Real-world data collection for embodied agents remains costly and unsafe, calling for scalable, realistic, and simulator-ready 3D environments. However, existing scene-generation systems often rely on rule-based or task-specific pipelines, yielding artifacts and physically invalid scenes. We present SAGE, an agentic framework that, given a user-specified embodied task (e.g., "pick up a bowl and place it on the table"), understands the intent and automatically generates simulation-ready environments at scale. The agent couples multiple generators for layout and object composition with critics that evaluate semantic plausibility, visual realism, and physical stability. Through iterative reasoning and adaptive tool selection, it self-refines the scenes until meeting user intent and physical validity. The resulting environments are realistic, diverse, and directly deployable in modern simulators for policy training. Policies trained purely on this data exhibit clear scaling trends and generalize to unseen objects and layouts, demonstrating the promise of simulation-driven scaling for embodied AI.

SAGE system diagram

Generated Scenes

Single-Room Scenes

SAGE generates realistic, diverse, and semantically coherent scenes spanning various styles and functionalities, from Bedroom and Office spaces to creative themes like "Cyberpunk game den" and "Starry-night bedroom".

Bedroom

Living room

Fairy-tale princess room

Rusty and dusty restroom

Gym

Office

Cyberpunk game den

Starry-night bedroom

Meeting room

Children room

Golden and luxury bedroom

Muddy and dirty dining room

Multi-Room Scenes

SAGE can be extended to generate multi-room scenes at scale easily by generating the floor plan and then calling generator MCP tools to multiple rooms in parallel.

The student apartment with one bedroom

The student apartment with two bedrooms

Mid-century modern family home

Craft supply hoarder's bungalow

Multilingual teacher's apartment

Naturalist's cabin

Image-Conditioned Scenes

Thanks to the capability of Agentic VLM model Qwen3-VL, SAGE can be conditioned on a reference image. Although the agent is not able to generate pixel-aligned scenes, it can generate scenes that are semantically coherent with the reference image.

Reference image: green room

Ref Image

Reference image: village

Ref Image

Reference image: umbridge

Ref Image

Generated Scene

Generated Scene

Generated Scene

Evaluation & Validation

Physical Stability

The critic agent evaluates physical stability and refines placement until scenes pass physical validity checks.

SAGE method overview

Embodied Policy Training

Policies trained purely on SAGE-generated data exhibit clear scaling trends and generalize to unseen objects and layouts, demonstrating the promise of simulation-driven scaling for embodied AI.

Policy training overview

SAGE-10k Dataset

SAGE-10k is a large-scale interactive indoor scene dataset featuring realistic layouts, generated by the agentic-driven pipeline introduced in SAGE. The dataset contains 10,000 diverse scenes spanning 50 room types and styles, along with 565K uniquely generated 3D objects.

SAGE-10k dataset overview

Citation

@inproceedings{xia2026sage,
  title={SAGE: Scalable Agentic 3D Scene Generation for Embodied AI},
  author={Xia, Hongchi and Li, Xuan and Li, Zhaoshuo and Ma, Qianli and Xu, Jiashu and Liu, Ming-Yu and Cui, Yin and Lin, Tsung-Yi and Ma, Wei-Chiu and Wang, Shenlong and Song, Shuran and Wei, Fangyin},
  booktitle={IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}