Researchers are tackling the challenge of creating realistic and scalable 3D environments for training embodied artificial intelligence agents. Hongchi Xia from NVIDIA and University of Illinois Urbana-Champaign, alongside Xuan Li and Zhaoshuo Li, present SAGE, a novel agentic framework designed to automatically generate simulation-ready scenes from user-defined tasks such as “pick up a bowl and place it on the table”. This work represents a significant advance because SAGE moves beyond rule-based systems, instead employing iterative reasoning and adaptive tool selection to produce diverse, semantically plausible and visually realistic environments. The resulting SAGE-10k dataset and the ability to train policies solely within these generated environments demonstrate a promising pathway towards simulation-driven scaling and improved generalisation for embodied AI systems, as shown by Ma, Xu, Liu et al.\n\n

Generating Scenes from High-Level User Instructions

\n\nAddressing the challenges of costly and unsafe real-world data collection for embodied agents, this work introduces a system that understands user intent, such as “pick up a bowl and place it on the table” , and autonomously constructs detailed, interactive scenes.\n\nThe core of SAGE lies in its ability to couple multiple generative models for both layout and object composition with sophisticated critics that rigorously evaluate semantic plausibility, visual realism, and physical stability. Through iterative reasoning and adaptive tool selection, the system refines scenes until they meet both user expectations and the demands of physics-based simulation.\n\nThe resulting environments are not merely visually appealing but are directly deployable within modern simulators for training robot policies. This innovative approach overcomes limitations of existing scene-generation systems, which often rely on rigid, rule-based pipelines or struggle to produce physically valid scenes.\n\n

Validating Environments with Physical and Semantic Critics

\n\nBy integrating a visual critic and a physics critic, validated within the Isaac Sim simulator, SAGE ensures that generated environments are both semantically coherent and physically stable under gravity and collisions. This closed-loop system allows for self-correction and adaptive refinement, eliminating the need for pre-defined tool orderings.\n\n

Demonstrating Policy Improvements and Dataset Creation

\n\nDemonstrating the practical impact of this technology, policies trained exclusively on data generated by SAGE exhibit clear scaling trends and demonstrate improved generalisation to previously unseen objects and layouts. The research culminates in the creation of the SAGE-10k dataset, a comprehensive resource for advancing embodied AI research. Multi-level augmentation techniques, encompassing object configuration, category, and layout variations, further enhance the diversity and robustness of the generated data, paving the way for more adaptable and intelligent robotic systems.\n\n

Iterative 3D environment generation via semantic and physical validation

\n\nSAGE, an agentic framework, generates simulation-ready 3D environments directly from user-specified embodied tasks such as “pick up a bowl and place it on the table”. The system couples multiple generators responsible for layout and object composition with two distinct critics evaluating semantic plausibility, visual realism, and physical stability.\n\nAn initial floor plan generator establishes the basic spatial arrangement, followed by a structured layout generator refining the scene’s organization. Text-to-3D asset generation then populates the layout with appropriate objects, responding to the user’s task description. A visual critic continuously assesses the semantic coherence and spatial arrangement of the generated scenes.\n\nThis critic provides feedback on aspects like object relationships and overall scene plausibility, guiding the iterative refinement process. Complementing this is a physics critic, integrated with the Isaac Sim simulator, which validates the physical stability of the scene under gravitational forces and potential collisions.\n\nThis simulator-in-the-loop verification ensures that generated environments are physically plausible and suitable for robotic interaction. The framework operates under the Model Context Protocol, enabling adaptive orchestration of these generators and critics. Through iterative reasoning, SAGE self-refines the scenes, selecting tools and adjusting parameters until both user intent and validity criteria are met.\n\nMulti-level augmentation, encompassing object configuration, category, and layout variations, expands the dataset’s diversity. This process yields realistic, diverse environments directly deployable in modern simulators for training embodied AI policies, and the resulting SAGE-10k dataset facilitates scalable data generation.\n\n

Enhanced Physical Stability and Realism in Agentic 3D Environment Generation

\n\nSAGE achieves 99.6% stability in generated scenes, as measured by relative translation under 0.2 metres and rotation under 8 degrees after 120 simulation steps within Isaac Sim. This metric assesses the physical realism of the environments created by the agentic framework. Collision rates were reduced to 1.9% through the implementation of a physics critic, a substantial improvement from an initial 7.8%.\n\nThe visual critic demonstrably improved visual quality, enhancing the overall realism of the generated 3D environments. Quantitative results, detailed in tabular data, consistently demonstrate SAGE’s superior performance across visual quality and physical stability when compared to baseline methods. Holodeck generated scenes with lower realism and functionality, while SceneWeaver exhibited high collision rates and low stability due to a lack of simulator validation.\n\nSAGE successfully generated highly diverse and stylized spaces, including Gym, Office, Cyberpunk game den, and Starry-night bedroom environments. The SAGE-10k dataset comprises 10,000 scenes across 50 room types and 50 styles, containing 565,000 uniquely generated 3D objects. This dataset supports large-scale community research and provides a rich resource for training embodied agents.\n\nMulti-room layouts were successfully generated, demonstrating the framework’s ability to create connected floor plans and update generators for multiple room IDs. Image-conditioned scene generation, enabled by Qwen3-VL, allows SAGE to extract style and object attributes from reference images, producing semantically consistent scenes without architectural modifications. Furthermore, the modular design facilitates the integration of articulated assets from PartNet-Mobility into generated scenes, expanding the range of possible environments.\n\n

Agentic environment generation enhances embodied AI training and generalisation

\n\nSAGE, an agentic framework, automatically generates simulation-ready indoor environments from open-vocabulary text prompts, addressing the need for scalable and realistic data for embodied agents. This system couples generators for scene layout and object composition with critics assessing semantic plausibility, visual realism, and stability.\n\nThrough iterative refinement and adaptive tool selection, SAGE produces diverse and deployable environments suitable for training artificial intelligence policies. Policies trained exclusively on these generated environments demonstrate clear scaling trends and improved generalisation to novel objects and layouts.\n\nEvaluation reveals that SAGE-trained policies outperform baseline approaches on both generated and existing scenes, achieving higher success rates in tasks such as pick-and-place and mobile manipulation. The framework’s success suggests a viable pathway towards scalable, simulation-driven learning for embodied artificial intelligence.\n\nCurrent work focuses on indoor scenes with rigid-body physics, and the authors acknowledge limitations in extending the system to outdoor environments or incorporating articulated and deformable objects. Future research directions include expanding task capabilities beyond pick, place, and navigation, and integrating online reinforcement learning with real-world robotic validation to further enhance performance.\n\n

\n👉 More information
\n🗞 SAGE: Scalable Agentic 3D Scene Generation for Embodied AI
\n🧠 ArXiv: https://arxiv.org/abs/2602.10116\n

Muhammad Rohail T.

AI Builds 3D Worlds for Robot Learning

Generating Scenes from High-Level User Instructions

Validating Environments with Physical and Semantic Critics

Demonstrating Policy Improvements and Dataset Creation

Iterative 3D environment generation via semantic and physical validation

Enhanced Physical Stability and Realism in Agentic 3D Environment Generation

Agentic environment generation enhances embodied AI training and generalisation

Latest Posts by Muhammad Rohail T.:

Atoms Linked Remotely with New Technique, Speeding Entanglement by 90 Percent

Quantum Circuits Bolster Artificial Intelligence Against Malicious Attacks

Quantum Measurements Deliver Reliable Predictions with Just 5000 Repetitions