Multi-Agent Robotic System Challenge Advances Embodied AI Planning and Control

Scientists are tackling the growing complexity of Embodied AI with a new multi-agent robotic system challenge. Li Kang, alongside Heng Zhou from USTC, Xiufeng Song from SJTU, and et al., have designed the Multi-Agent Robotic System (MARS) Challenge , hosted at the NeurIPS 2025 Workshop on SpaVLE , to push the boundaries of collaborative artificial intelligence. This competition uniquely focuses on both planning and control, requiring participants to utilise vision-language models to coordinate robotic teams in dynamic environments. By evaluating submitted solutions, MARS promises valuable insights into scalable, efficient and advanced human-agent interaction, representing a significant step towards truly collaborative AI systems.

By evaluating submitted solutions, MARS promises valuable insights into scalable, efficient and advanced human-agent interaction, representing a significant step towards truly collaborative AI systems.

Multi-agent0.1, overall scores concentrate in the 0.4, 0.6 range, indicating substantial headroom. Fig0.1 provides a more fine-grained view: for many tasks, a large fraction of teams achieve non-zero accuracy, suggesting that generating feasible action sequences is often within reach. However, high-accuracy tasks are comparatively sparse, and performance drops markedly on a non-trivial subset of tasks. These patterns imply that the primary bottleneck is not merely producing valid plans, but high-quality plans that satisfy feasibility while leveraging multi-agent parallelism and minimizing redundancy, efficiency and coordination remain challenging even when task completion is possible. Researchers also identified the 20 hardest tasks in Fig0.2, highlighting scenarios where even top-performing teams struggle. Analysis of failure cases reveals several recurring difficulties. Ambiguous instructions such as “carry the food” require comprehensive Scene understanding, where missing even a single relevant object can significantly degrade performance. Effective solutions demand parallel action assignment across agents, whereas sequential planning underutilizes available robots. Finally, long-horizon tasks amplify the impact of early planning errors due to prefix-based evaluation, making initial agent selection and action ordering particularly critical. These findings indicate that the Planning Track evaluates not only action prediction accuracy, but also collaborative efficiency and holistic multi-agent reasoning. As shown in Fig0.3, researchers designed four tasks that require robotic arm collaboration: Place Cube in Cup, Strike Cube (Hard), Three Robots. The work enforced a standardised computational constraint by running all final inference models on a single RTX 4090 GPU, allowing for fair comparison of algorithmic efficiency. While some progress occurred in dual-arm scenarios, the final two tasks, requiring coordination between three or more robotic arms, saw near-total failure, highlighting the exponential increase in complexity with expanding action spaces. The champion solution, “Scaling Embodied Planning via Self-Correction”, pioneered a framework leveraging the iterative refinement capabilities of VLMs, treating planning as an evolving process rather than a single deterministic pass. This approach generates multiple stochastic plans, employing a voting mechanism to select a consensus solution, effectively escaping suboptimal local optima and improving robustness in heterogeneous agent scenarios. Scientists initiated this process with manually annotated examples, then used VLMs to generate seed training data from task instructions and scene observations, subsequently refining plans via supervised fine-tuning and iterative data expansion. Furthermore, the runner-up solution, “Modular Closed-Loop Framework for Multi-Agent Coordination”, adopted a structural decomposition strategy, utilising collaborative components, Activate Agent, Planning Agent, and Monitor Agent, to process user instructions and scene images. The Activate and Planning Agents underwent supervised fine-tuning using datasets derived from the adjusted VIKI benchmark, while the self-correcting data generation pipeline addressed data scarcity and task heterogeneity, demonstrating a complementary approach to enhancing multi-agent coordination.

MARS Challenge Reveals VLM Planning Limitations

Experiments revealed that overall planning scores concentrated in the 0.4, 0.6 range, indicating considerable potential for improvement under the track’s evaluation criteria. Data shows that while many teams achieved non-zero accuracy on several tasks, high-accuracy solutions remained sparse, and performance decreased markedly on a subset of more challenging tasks. The team measured task complexity, finding that simple tasks, like opening an appliance, required short action sequences, whereas complex tasks demanded coordinated execution by multiple robots over extended horizons. Specifically, task_147, involving transporting multiple food items into a refrigerator, exemplified a long-horizon multi-agent task requiring up to ten planning steps.
Results demonstrate that the primary bottleneck wasn’t generating valid plans, but producing high-quality plans that were both feasible and efficient, even when task completion was possible. Analysis identified the 20 hardest tasks, highlighting scenarios where even top-performing teams struggled, revealing the need for improved collaborative efficiency and holistic multi-agent reasoning. Tests prove that each task demanded perception and decision-making abilities from the agents, with all robots operated via joint position control and evaluated using a unified model architecture. The team collected data from 100 trials per task, using the average success rate as the participant’s score, and allowed unlimited data collection with flexible camera positioning and data modalities. These results clearly demonstrate that current control frameworks lack the robustness and generalization capabilities needed for high-dimensional multi-agent collaborative tasks, particularly those requiring precise cooperation among three or more robotic arms. The champion solution, “Scaling Embodied Planning via Self-Correction”, introduced a framework leveraging the iterative refinement capabilities of VLMs, treating planning as an evolving process of generation, evaluation, and consensus.

MARS Challenge Highlights Optimisation and Specialisation in robotic

Scientists are increasingly focused on multi-agent systems as a crucial step towards scalable and efficient Embodied AI. Findings from the MARS Challenge emphasise the significance of iterative optimisation and agent specialisation for improved multi-agent collaboration. RoboFactory, a structured benchmark, enabled systematic evaluation of collaborative manipulation, while MARS itself provided a benchmark for embodied multi-agent manipulation on common tasks. The authors acknowledge limitations inherent in benchmarking complex systems, noting the difficulty of fully capturing real-world variability within a simulated environment.

Future research directions could explore more robust generalisation to unseen scenarios and the integration of human feedback to refine agent behaviour. Ultimately, this work contributes to the development of scalable, flexible, and efficient multi-agent systems, offering valuable insights for future research and real-world applications of advanced collaborative AI. The challenge’s results demonstrate the potential for coordinated robotic systems to tackle complex tasks, paving the way for more sophisticated and adaptable AI solutions.

👉 More information
🗞 Advances and Innovations in the Multi-Agent Robotic System (MARS) Challenge
🧠 ArXiv: https://arxiv.org/abs/2601.18733

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Crsb Quantum Oscillations Confirm Four Spin-Split Bands and Altermagnetic Fermi Surface

Crsb Quantum Oscillations Confirm Four Spin-Split Bands and Altermagnetic Fermi Surface

January 28, 2026
Entanglement Advances Quantum Differential Privacy with Defined Entanglement Entropy Levels

Entanglement Advances Quantum Differential Privacy with Defined Entanglement Entropy Levels

January 28, 2026
Universal Privacy Framework Achieves Untrusted Data Security in Distributed Quantum Sensing

Universal Privacy Framework Achieves Untrusted Data Security in Distributed Quantum Sensing

January 28, 2026