Shows MAPPA Improves Multiagent Systems by +17.5pp on Unseen Competition Math Problems

Researchers are tackling the significant hurdles preventing wider adoption of multiagent systems, specifically the difficulties in assigning credit for success and maximising learning from limited interactions. Ed Li, Junyu Ren, and Cat Yan, all from Yale University, present a novel approach called finetuning multiagent systems with per-action process rewards (MAPPA) to address these issues. Their work demonstrates that by providing feedback on individual actions, rather than solely at the end of a task, MAPPA enables more granular supervision and extracts greater learning potential from each trial. Results on challenging competition math problems and tool-augmented data analysis show substantial improvements, up to 17.5 percentage points on AIME and AMC maths problems, and a 12.5 percentage point increase in data analysis success rates, suggesting a crucial step towards scaling these systems for complex tasks with minimal human input.

The research team addressed key challenges in finetuning multiple agents simultaneously: credit assignment and sample efficiency of multiagent rollouts.

Their work introduces a method called finetuning multiagent systems with per-action process rewards from AI feedback, or MAPPA, which assigns credit to individual agent actions rather than solely at task completion. This allows for fine-grained supervision without requiring ground truth labels, maximising the training signal extracted from each rollout.
Experiments showcased MAPPA’s effectiveness on both competition math problems and tool-augmented data analysis tasks. On unseen math problems, the system achieved improvements ranging from +5.0 to +17.5 percentage points on the AIME benchmark and +7.8 to +17.2 percentage points on the AMC benchmark. For data analysis tasks, the method improved success rates by +12.5 percentage points, with quality metrics increasing by up to 30 percent.

These results validate that per-action supervision can drive improvements across diverse multiagent systems and various domains. The core innovation lies in leveraging language models as coaches to assess the quality of each agent’s action, considering its role, inputs, and environment feedback. This yields dense learning signals throughout the trajectory, even when tasks fail, and enables implicit credit assignment, identifying the responsible agent when errors occur.

By scaling the number of training signals with the number of actions, MAPPA dramatically improves sample efficiency compared to outcome-based methods. This research establishes a general framework for training multiagent systems on complex, tool-augmented tasks via coach-guided reinforcement learning.

The agent topology, reward structure, and training pipeline are domain-agnostic, supporting diverse applications. The team’s source code is publicly available, and their work represents a crucial first step toward scaling multiagent systems for complex, long-horizon tasks with minimal human intervention.

Finetuning multiagent systems via per-action rewards from language model coaches enables more effective collaboration

Scientists developed a novel methodology, MAPPA (finetuning MultiAgent systems with Per-action Process rewards from feedback), to address challenges in training multiagent systems for complex tasks. The study tackled issues of credit assignment and sample efficiency in multiagent rollouts by assigning rewards to individual agent actions rather than solely at task completion.

This approach enables fine-grained supervision without requiring ground truth labels, maximising the training signal extracted from each rollout. Researchers engineered a system where language models function as coaches, assessing the quality of each agent’s action based on its role, inputs, and environmental feedback, such as tool execution results.

This yielded dense learning signals throughout the trajectory, facilitating effective training even when tasks failed. Crucially, the coach implicitly assigns credit, penalising upstream agents for faulty actions that cause downstream errors, rather than punishing the downstream agents themselves. Experiments employed two distinct domains: mathematical reasoning using the MathChat environment and end-to-end data science pipelines within DSBench.

On MathChat, a three-agent system achieved performance improvements of +5.0, 17.5 percentage points on AIME and +7.8, 17.2 percentage points on AMC, utilising two different model configurations. For DSBench, the method improved success rate by +12.5 percentage points, with quality metrics increasing by up to 30%, demonstrating the effectiveness of per-action supervision across diverse multiagent systems. The team harnessed this approach to validate that scaling the number of specialised agents represents a promising new dimension for improving performance on complex, long-horizon tasks.

MAPPA achieves substantial gains in multiagent system performance and data analysis through innovative algorithms

Scientists have developed a new method, MAPPA, for finetuning multiagent systems that addresses key challenges in credit assignment and sample efficiency. The research demonstrates that assigning credit to individual actions, rather than solely at task completion, enables fine-grained supervision without requiring ground truth labels.

Experiments revealed that this approach extracts maximal training signal from each rollout, leading to significant performance improvements across diverse domains. On unseen mathematical problems, the team measured a +5.0 to +17.5 percentage point (pp) increase on the AIME benchmark and a +7.8 to +17.2pp improvement on the AMC benchmark.

These gains were achieved across two distinct model configurations, validating the robustness of the MAPPA framework. For complex data analysis tasks using the DSBench platform, the method improved the success rate by +12.5pp, while simultaneously enhancing quality metrics by up to 30%. The breakthrough delivers dense learning signals throughout task trajectories, even when tasks fail, by leveraging language models as coaches to assess the quality of each agent’s action.

Crucially, the coach performs implicit credit assignment; for example, when a downstream agent encounters a file-not-found error, the coach assigns low scores to the upstream agent responsible for producing the missing file. Measurements confirm that the number of training signals scales with the number of actions taken, dramatically improving sample efficiency compared to outcome-based methods.

Tests prove that MAPPA can function with or without ground truth verification, offering flexibility in various application scenarios. The research presents a general framework, with components, agent topology, reward structure, and training pipeline, designed to be domain-agnostic and adaptable to diverse applications. Overall, the results suggest that scaling the number of specialized agents represents a promising new dimension for improving performance on complex, long-horizon tasks with minimal human supervision.

Per-action rewards enhance multiagent system finetuning and performance significantly

Scientists have developed a new method, termed MAPPA (finetuning multiagent systems with per-action process rewards from feedback), to improve the training of multiagent systems. This approach addresses key challenges in finetuning multiple agents simultaneously: effectively assigning credit for actions and maximising sample efficiency during expensive multiagent rollouts.

By providing feedback on individual actions rather than solely at task completion, MAPPA enables detailed supervision without requiring ground truth labels and extracts more training signal from each simulation. The research demonstrates significant performance gains on both competition math problems and tool-augmented data analysis tasks.

On unseen math problems, MAPPA achieved improvements of between 5.0 and 17.5 percentage points on the AIME benchmark and 7.8 to 17.2 percentage points on the AMC benchmark. Furthermore, success rates in data analysis tasks increased by 12.5 percentage points, with quality metrics improving by up to 30 percent.

These results validate the effectiveness of per-action supervision across diverse multiagent system applications. The authors acknowledge that reward hacking, where agents optimise for assigned scores rather than overall system success, remains a potential concern. They highlight the importance of monitoring behavioural metrics, such as response length and tool call rate, to diagnose and mitigate such issues.

Future research should focus on auditing coach models for biases and continuously monitoring agent behaviour to ensure alignment with desired outcomes. This work represents a step towards scaling multiagent systems for complex, long-horizon tasks with minimal human oversight, potentially enabling advancements in areas like enterprise workflows and scientific discovery.

👉 More information
🗞 Scaling Multiagent Systems with Process Rewards
🧠 ArXiv: https://arxiv.org/abs/2601.23228

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

AI Learns to Recommend Medicines Even for Patients with No Prescription History

AI Learns to Recommend Medicines Even for Patients with No Prescription History

February 5, 2026
AI Workshops Boost Teens’ Ability to Spot Fake Videos and Images

AI Workshops Boost Teens’ Ability to Spot Fake Videos and Images

February 5, 2026
AI Steers Towards Fully Autonomous Driving, Overcoming Complex Real-World Obstacles

AI Steers Towards Fully Autonomous Driving, Overcoming Complex Real-World Obstacles

February 5, 2026