Artificial Intelligence Training Avoids Repeating Patterns to Sustain Reasoning Skills

Researchers are addressing a critical limitation in the self-play training of large language models, where initial performance improvements often diminish over time. Gengsheng Li, Jinghan He, and Shijie Wang, from the Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences, alongside Dan Zhang and Junfeng Fang from the National University of Singapore, Ruiqi Liu and Renrui Zhang from the Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences, Zijun Yao from Tsinghua University, and Haiyun Guo and Jinqiao Wang from the Institute of Automation, Chinese Academy of Sciences, demonstrate that this degradation stems from a phenomenon they term ‘Diversity Illusion’. Their work, a collaborative effort between these institutions, reveals that despite apparent variation in training data, the models are repeatedly exposed to fundamentally similar reasoning challenges. To combat this, they propose R-Diverse, incorporating innovations such as Memory-Augmented Penalty and Skill-Aware Measurement, which demonstrably sustains performance gains across ten diverse reasoning benchmarks and surpasses existing self-play methodologies.

Scientists have developed a new framework, R-Diverse, to address a critical limitation in the self-improvement of large language models (LLMs) through self-play, tackling a phenomenon termed “Diversity Illusion”. Current self-play methods, such as R-Zero, often demonstrate initial gains that subsequently plateau or diminish, hindering sustained progress in reasoning capabilities.

The research identifies that existing diversity constraints focus on superficial variations in questions, failing to prevent the Solver, the LLM being trained, from encountering recurring reasoning challenges disguised as novel problems. R-Diverse introduces two key innovations: Memory-Augmented Penalty (MAP) and Skill-Aware Measurement (SAM). MAP employs a persistent memory bank to actively discourage the re-introduction of previously seen questions, extending the scope of diversity beyond individual training batches.

SAM, conversely, shifts the focus from question variation to the actual reasoning skills being exercised, ensuring that the Solver is challenged with genuinely diverse cognitive demands. By evaluating diversity based on the underlying skills required to solve problems, rather than simply the phrasing of the questions, R-Diverse aims to prevent the Solver from becoming trapped in cycles of superficially different but fundamentally similar tasks.

A 72-qubit superconducting processor forms the foundation of the methodology for investigating sustained gains in LLM self-play, addressing the challenge of non-sustained improvement where initial performance gains diminish with continued self-play iterations. An iterative Challenger-Solver loop was implemented, where the Challenger generates questions designed to test the Solver’s reasoning abilities, and the Solver is then trained on these questions to enhance its skills.

To pinpoint the cause of performance plateaus, the team meticulously tracked both cross-iteration and intra-iteration repetition, assessing how often newly generated questions resembled those from previous iterations and how similar the underlying reasoning requirements were within a single iteration. The identification of Local Diversity Illusion, arising from enforcing diversity only within a single batch of questions, and Surface Diversity Illusion, where questions appear varied but demand nearly identical reasoning skills, led to the development of R-Diverse.

MAP utilizes a persistent memory bank to discourage the recycling of questions across iterations, preventing the Solver from repeatedly encountering the same problem types. SAM moves beyond superficial question variation to evaluate diversity based on the reasoning skills the Solver is actually exercising, achieved by mapping each question to a canonical solver-level program representing its solution procedure, allowing assessment of diversity at the level of underlying reasoning demands.

The Memory-Augmented Penalty (MAP) successfully discouraged the recycling of questions across iterations, preventing the Diversity Illusion. Skill-Aware Measurement (SAM) proved crucial in evaluating diversity based on the reasoning skills required, rather than merely the surface-level variation of questions. This was achieved by mapping natural language questions to solver-level code using the Qwen2.5-Coder-7B model, followed by embedding-based similarity computation using the Jina-Code-Embeddings-1.5B encoder.

The resulting similarity metric, φSAM, was integrated into both the repetition penalty and the memory-augmented penalty, ensuring consistent enforcement of reasoning skill diversity. The Challenger training objective was modified to maximise a composite reward balancing difficulty with both within-iteration and cross-iteration novelty, as measured by φSAM.

Solver training utilised a standard optimisation formulation, with optional memory replay incorporating historical high-quality question-answer pairs at a target ratio of 0.3, ensuring the Solver maintained competence on diverse problem types while adapting to the evolving curriculum generated by the Challenger. On the AIME24 benchmark, R-Diverse achieved 19.17% accuracy with Qwen3-4B-Base, a significant increase from the base model’s 10.94%, and 16.35% with Qwen3-8B-Base.

Experiments, conducted with 5 training steps for the Challenger and 15 for the Solver per iteration, demonstrate the robustness of R-Diverse across varying model scales and benchmark complexities. The problem is not a lack of variation in the questions, but a lack of variation in the types of reasoning required to solve them.

Cleverly disguised repetition can fool traditional diversity metrics, as demonstrated by examples showing how questions that appear distinct can demand nearly identical solution pathways. Their solution, R-Diverse, tackles this head-on with innovations that focus on the skill being tested, not just the surface form of the question. By maintaining a memory of previously encountered reasoning patterns and actively penalising repetition, they’ve demonstrably sustained improvements across a range of benchmarks. While the current work focuses on mathematical and general reasoning tasks, the implications extend far beyond, and the true test will be whether these techniques can be scaled to more complex domains and integrated with other learning paradigms.

👉 More information
🗞 R-Diverse: Mitigating Diversity Illusion in Self-Play LLM Training
🧠 ArXiv: https://arxiv.org/abs/2602.13103

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Ultrafast Tracking Reveals How Energy Flows Within Magnetic Materials in Picoseconds

Ultrafast Tracking Reveals How Energy Flows Within Magnetic Materials in Picoseconds

February 17, 2026
Chaotic Sensors Boost Measurement Precision Even with Limited Access

Chaotic Sensors Boost Measurement Precision Even with Limited Access

February 17, 2026
Artificial Intelligence Tracks Arm Movements with a Single Camera for Clinical Assessments

Artificial Intelligence Tracks Arm Movements with a Single Camera for Clinical Assessments

February 17, 2026