Training large language models to reason effectively typically requires reinforcement learning with specific tools to check answers, but many real-world problems lack these tools, leaving valuable examples from human experts untapped. Locke Cai from the Massachusetts Institute of Technology and Ivan Provilkov from Together AI address this challenge with a new approach, demonstrating how to learn strong reasoning skills directly from expert demonstrations. Their method, called RARO, establishes a dynamic interplay between a system that generates answers and a system that evaluates them, encouraging the generator to closely mimic expert solutions. The team’s work significantly outperforms existing methods that do not rely on verifiers across a range of tasks, including mathematical problem solving and creative writing, and reveals a pathway to robust reasoning even when automated checking is unavailable.
Model Performance Across Math and Poetry Tasks
This research explores the capabilities of large language models in tackling diverse tasks, including mathematical problem solving and creative writing. The study examines how these models perform when presented with a series of prompts and challenges, analyzing their ability to reason, generate text, and critique existing content. The experiments involve multi-turn interactions, where the model responds to prompts and receives feedback, demonstrating its capacity for iterative reasoning. Observations reveal that the model exhibits task-specific expertise, performing differently on mathematical problems compared to poetic endeavors.
It demonstrates an ability to apply chain-of-thought reasoning, explaining its steps and justifying its answers, which enhances understanding of its decision-making process. Qualitative evaluations of the model’s responses highlight its strengths and weaknesses, providing insights for further improvement. The model successfully generates free-verse poetry, showcasing its creative potential and linguistic understanding.
Robust Reasoning via Adversarial Inverse Reinforcement Learning
Scientists have developed RARO, a novel reinforcement learning algorithm that empowers large language models with robust reasoning skills using only expert demonstrations. This innovative approach circumvents the need for task-specific verifiers or human preference data, offering a streamlined training process. RARO establishes a dynamic interaction between a policy, which generates answers, and a relativistic critic, which evaluates those answers against expert responses. This method trains both the policy and the critic simultaneously through reinforcement learning, identifying crucial stabilization techniques for robust learning.
Researchers evaluated RARO on a controlled reasoning task, where it significantly outperformed existing methods lacking verification mechanisms and nearly matched the performance of reinforcement learning with verifiable rewards. Scaling the method to the DeepMath dataset, a collection of mathematical problems, further confirmed RARO’s effectiveness and comparable scaling trends to reinforcement learning with verifiable rewards. Finally, scientists demonstrated RARO’s generalizability by evaluating it on Poetry Writing, where it substantially outperformed all baselines, highlighting its effectiveness in open-ended tasks lacking explicit verification criteria. These results demonstrate that RARO effectively elicits strong reasoning performance from expert demonstrations, enabling robust reasoning learning even when task-specific verifiers are unavailable.
RARO Achieves Human-Level Reasoning Without Verification
Scientists have developed RARO (Relativistic Adversarial Reasoning Optimization), a new method that trains large language models to reason effectively using only expert demonstrations, without requiring task-specific verifiers. This work addresses a significant challenge in artificial intelligence, enabling robust reasoning even when direct verification of answers is impossible. The research team framed the problem as inverse reinforcement learning, developing a system where a model learns to mimic expert answers and a critic distinguishes between the model’s responses and those of the expert. Experiments on a controlled reasoning task demonstrate that RARO significantly outperforms existing methods that do not use verification, achieving performance nearly equivalent to systems trained with verifiable rewards.
Scaling the method to the DeepMath dataset, a collection of mathematical problems, further confirms RARO’s effectiveness; the system not only surpasses baseline models without verification but also exhibits similar scaling trends to reinforcement learning methods that do utilize verification. This indicates that RARO can effectively elicit strong reasoning capabilities from demonstration data alone. The team then tested RARO’s generalizability by applying it to the non-verifiable domain of poetry writing, demonstrating substantial performance gains over all baseline models.
Relativistic Reasoning Learns From Expert Demonstrations
Researchers have developed RARO (Relativistic Adversarial Reasoning Optimization), a new method for training large language models to perform complex reasoning tasks. This approach bypasses the need for task-specific verifiers, which are often expensive or unavailable, by instead learning directly from expert demonstrations. RARO establishes an interaction between a policy, which generates answers, and a relativistic critic, which evaluates the quality of those answers in comparison to the expert demonstrations. Through this adversarial process, the model learns to mimic expert reasoning without requiring explicit feedback on the correctness of each step. Experiments across a range of tasks, including logical puzzles, mathematical problem solving, and creative writing, demonstrate that RARO significantly outperforms existing methods that do not rely on verifiers. Notably, the method achieves performance comparable to systems trained with verifiers on certain tasks, and exhibits similar scaling trends as these systems with larger models.
👉 More information
🗞 Escaping the Verifier: Learning to Reason via Demonstrations
🧠 ArXiv: https://arxiv.org/abs/2511.21667
