Embodied artificial intelligence requires robots to understand and interact with complex environments, demanding a robust method for representing the world around them. Yuanchen Ju, Yongyuan Liang, Yen-Jen Wang, and colleagues introduce MomaGraph, a novel unified scene representation that integrates spatial relationships, object function, and interactive elements, offering a more complete picture of household environments. This research addresses key limitations in existing scene graph approaches, which often treat scenes as static or separate spatial and functional information, by creating a dynamic, task-aware understanding of surroundings. The team also presents MomaGraph-Scenes, a large-scale dataset of richly annotated scene graphs, and MomaGraph-Bench, a comprehensive evaluation suite, to rigorously test and advance this technology, culminating in MomaGraph-R1, a 7B vision-language model that achieves state-of-the-art results in task planning and demonstrates strong generalisation capabilities, even transferring successfully to real-world robotic experiments.

MomaGraph-R1, Scene Graphs for Robot Control

Scientists have developed MomaGraph-R1, a new artificial intelligence system designed to help robots understand and interact with indoor environments. This system generates detailed scene graphs, structured representations of objects, their spatial relationships, and how they can be used, allowing robots to plan and execute tasks effectively. The research team created a comprehensive dataset and employed reinforcement learning techniques to train the system, demonstrating significant improvements over existing approaches. A core achievement is the MomaGraph-Scenes dataset, combining real and simulated data with detailed annotations of task-relevant scene graphs.

This dataset provides the foundation for training AI models to understand complex indoor environments. The team then developed MomaGraph-R1, a 7B vision-language model trained using reinforcement learning, guided by a reward function that prioritises accurate and task-oriented scene graph construction. The system excels at generating structured scene graphs that accurately capture the relationships between objects and their potential uses, allowing robots to plan sequences of actions for everyday tasks. Demonstrations with a physical robot in realistic household environments confirm the system’s ability to operate effectively in the real world.

Scientists Method

Scientists developed MomaGraph, a novel scene representation that unifies spatial and functional relationships within household environments. This representation incorporates detailed information about object parts and how they interact, creating a more compact, dynamic, and task-relevant understanding of the environment. To support this work, the team constructed MomaGraph-Scenes, a large-scale dataset of richly annotated, task-driven scene graphs, including multi-view observations and executed actions. To train and evaluate MomaGraph, researchers developed MomaGraph-R1, a 7B vision-language model trained with reinforcement learning on the MomaGraph-Scenes dataset.

A specifically designed reward function guided the model towards constructing accurate, task-oriented scene graphs, optimising spatial and functional reasoning capabilities. This model not only predicts scene graphs but also functions as a zero-shot task planner, generating structured scene graphs as an intermediate step before planning actions, significantly improving reasoning effectiveness. Experiments demonstrate that MomaGraph-R1 achieves state-of-the-art results among open-source models, reaching 71.6% accuracy on a new benchmark, a substantial 11.4% improvement over previous methods. These gains translate to strong generalisation across public benchmarks and effective performance in real-world robotic experiments, demonstrating the practical impact of this innovative approach.

MomaGraph Unifies Spatial and Functional Scene Understanding

Scientists have developed MomaGraph, a novel scene representation that unifies spatial and functional relationships within household environments, and incorporates detailed information about object parts and their interactions. This represents a significant advancement in embodied artificial intelligence, addressing limitations in existing scene graphs that typically treat spatial and functional relationships separately and struggle with dynamic scenes. The team constructed MomaGraph-Scenes, the first large-scale dataset of richly annotated, task-driven scene graphs, providing detailed multi-view observations, executed actions, and task-aligned annotations. Building upon this foundation, researchers developed MomaGraph-R1, a 7B vision-language model trained with reinforcement learning on the MomaGraph-Scenes dataset, guided by a reward function designed to optimise the construction of accurate, task-oriented scene graphs.

This model not only predicts scene graphs but also functions as a zero-shot task planner, generating structured scene graphs as intermediate representations to improve reasoning effectiveness and interpretability. Extensive experiments demonstrate that MomaGraph-R1 achieves state-of-the-art results among open-source models, reaching 71.6% accuracy on the benchmark, a substantial improvement of 11.4% over the best baseline. Results confirm that MomaGraph-R1 generalises effectively across public benchmarks and translates these improvements into strong performance in real-world robotic experiments, demonstrating a significant step forward in the development of intelligent, embodied agents.

MomaGraph Enables Dynamic, Task-Relevant Scene Understanding

This work addresses fundamental limitations in existing scene graph representations for embodied agents, specifically their reliance on single relationship types, inability to adapt to dynamic environments, and lack of task relevance. Researchers introduce MomaGraph, a novel scene representation that unifies spatial and functional scene graphs with interactive elements, enabling a more comprehensive understanding of household environments. To facilitate learning this representation, the team constructed MomaGraph-Scenes, a large-scale dataset of richly annotated scene graphs, and developed MomaGraph-R1, a 7B vision-language model trained with reinforcement learning. MomaGraph-R1 predicts task-oriented scene graphs and functions as a zero-shot task planner, demonstrating state-of-the-art performance among open-source models and competitive results compared to closed-source systems.

Extensive experiments confirm its ability to generalise across public benchmarks and transfer effectively to real-robot experiments, achieving 71.6% accuracy on the benchmark and a 70% overall success rate on complex multi-step tasks. Researchers also introduced MomaGraph-Bench, a comprehensive benchmark designed to rigorously evaluate both fine-grained reasoning and high-level planning capabilities. This work establishes a foundation for advancing scene representations, fostering stronger connections between the spatial vision-language modelling and robotics communities, and ultimately enabling more intelligent and adaptive embodied agents capable of operating effectively in real-world environments.

👉 More information
🗞 MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Model for Embodied Task Planning
🧠 ArXiv: https://arxiv.org/abs/2512.16909

Tags:

Reinforcement Learning

Muhammad Rohail T.

MomaGraph: AI Scene Graphs Boost Robot Understanding

MomaGraph-R1, Scene Graphs for Robot Control

Scientists Method

MomaGraph Unifies Spatial and Functional Scene Understanding

MomaGraph Enables Dynamic, Task-Relevant Scene Understanding

Latest Posts by Muhammad Rohail T.:

Quantum Light Speeds Atomic Ionization

Quantum States Predictably Distribute with Noise

Quantum Networks: Unknown State Verification Limit