Researchers Enhance Language Models with Diversity-Aware Reinforcement Learning for High-quality, Diverse Outputs

Language models frequently sacrifice the breadth of their responses in pursuit of accuracy and helpfulness, creating a significant limitation for tasks demanding creativity and exploration. Tianjian Li, Yiming Zhang from Carnegie Mellon University, and Ping Yu, alongside Swarnadeep Saha, Daniel Khashabi, and Jason Weston from Meta, address this challenge with a new framework called Diversity-Aware Reinforcement Learning, or DARLING. This method moves beyond simply assessing surface-level differences in text, instead employing a learned function to measure semantic diversity, and integrates this with a quality assessment during the learning process. The results demonstrate that DARLING consistently improves both the quality and novelty of generated text across a range of tasks, from creative writing to solving mathematical problems, and importantly, reveals that actively encouraging diversity actually enhances the learning process itself, leading to even better responses.

Language Model Diversity and Exploration Challenges

Large language models often lose diversity during training, becoming overly focused on generating high-reward responses and producing repetitive outputs. This limits their creativity, hinders exploration, and impacts the training process itself, as diversity is crucial for effective learning. Existing methods for encouraging diversity are often computationally expensive or don’t generalize well across different models and training setups. This research introduces Gradient-Reward-Proportionate Optimization (GRPO), a simple weighting mechanism that adjusts rewards based on the magnitude of the gradient for a given response.

The core idea is to incentivize larger gradient updates for responses that are both high-quality and diverse, encouraging the model to move more significantly towards diverse solutions. GRPO doesn’t require changes to the data generation process or complex hyperparameter tuning; it’s a straightforward modification to the reward function. Results demonstrate that GRPO improves both the quality and diversity of generated text, particularly in tasks like math problem-solving. Analysis revealed that models sometimes attempt to “hack” diversity rewards by generating irrelevant text after a correct answer, simply to increase their diversity score, highlighting the importance of careful reward design.

This work offers a simple yet effective way to address the critical problem of diversity loss in language models, leading to more creative, robust, and efficient AI systems. Imagine teaching a student to solve math problems. You don’t just want them to get the right answer; you also want them to understand the concepts and approach problems from different angles. GRPO is like giving the language model extra encouragement to explore different solution paths, even if they don’t immediately lead to the correct answer. This helps the model develop a deeper understanding and become a more creative and robust problem-solver.

Diversity Metric for Language Model Outputs

Researchers developed Diversity-Aware Reinforcement Learning (DARLING) to address a common challenge in large language models: the tendency to prioritize accuracy and helpfulness at the expense of generating diverse outputs. This method aims to balance response quality with semantic diversity, enabling more creative and exploratory applications like brainstorming and storytelling. The core of DARLING lies in a learned partition function that measures diversity beyond simple lexical variations, allowing the system to identify genuinely distinct ideas. To quantify diversity, scientists trained a binary classifier to determine semantic equivalence between responses, grouping similar outputs into clusters.

This classifier forms the basis of a diversity metric, calculating the average pairwise distance between a given response and all others within a set. For example, when presented with prompts requesting programming jokes, the system accurately grouped responses utilizing puns on the word “bug” as semantically equivalent, assigning a lower diversity score than to jokes employing entirely different concepts. This nuanced approach moves beyond surface-level comparisons, capturing deeper differences in meaning. DARLING then integrates this diversity signal into a reinforcement learning framework, combining it with a quality reward during training.

Researchers multiplied the quality and diversity rewards, rather than simply adding them, to avoid one signal overshadowing the other. Furthermore, the team refined standard reinforcement learning techniques by switching from sequence-level to token-level loss averaging and removing standard deviation normalization, improving training stability and reducing noise. This innovative approach enables the generation of more varied and creative outputs from large language models.

Diversity and Quality in Language Models

Researchers have developed a new framework, Diversity-Aware Reinforcement Learning (DARLING), that simultaneously optimizes both the quality and diversity of responses generated by large language models. This addresses a common problem where improving accuracy often comes at the expense of generating varied and creative outputs, limiting usefulness in tasks like brainstorming or storytelling. The team discovered that by explicitly encouraging diversity, they could not only maintain high-quality responses but also unlock greater exploration during the learning process. At the heart of DARLING is a method for measuring diversity that goes beyond simple lexical differences, utilizing a learned classifier to assess semantic similarity between responses.

This allows the framework to group responses with equivalent meanings, effectively identifying and rewarding genuinely novel outputs. The researchers then combine this diversity signal with a traditional quality reward during reinforcement learning, amplifying the benefits of responses that are both accurate and distinct. Experiments demonstrate that DARLING consistently outperforms standard quality-focused reinforcement learning baselines across a range of tasks. Specifically, on five benchmarks involving creative writing and instruction following, DARLING achieved higher quality and novelty in generated responses.

Importantly, the framework also excels in verifiable tasks, such as solving competition math problems, achieving improvements in both solution accuracy and the variety of solutions generated. The results show that explicitly optimizing for diversity doesn’t just broaden the range of outputs, it actually leads to higher-quality responses overall, suggesting that exploration and quality are intrinsically linked. This breakthrough delivers a powerful new approach to language model training, paving the way for more creative, versatile, and effective AI systems.

Diversity Boosts Quality in Language Models

This work introduces DARLING, a new online reinforcement learning method that simultaneously optimizes for both the quality and diversity of language model outputs. Unlike previous approaches which often struggle to maintain diversity during training, DARLING effectively preserves a broad range of responses while also improving overall quality. Experiments across various model sizes and tasks, including both those with verifiable solutions and more open-ended creative tasks, demonstrate DARLING’s consistent outperformance compared to methods focused solely on quality. The research shows that explicitly encouraging diversity during training not only expands the range of generated ideas but also surprisingly enhances the quality of responses, likely by promoting more thorough exploration of the solution space. While the authors acknowledge that prompting methods can also improve diversity, they highlight that these often come at the cost of reduced quality, a trade-off avoided by directly modifying the training objective. Future work could explore the application of DARLING to even more complex tasks and investigate how the learned diversity signal interacts with different model architectures and training datasets.

👉 More information
🗞 Jointly Reinforcing Diversity and Quality in Language Model Generations
🧠 ArXiv: https://arxiv.org/abs/2509.02534

Quantum News

Quantum News

As the Official Quantum Dog (or hound) by role is to dig out the latest nuggets of quantum goodness. There is so much happening right now in the field of technology, whether AI or the march of robots. But Quantum occupies a special space. Quite literally a special space. A Hilbert space infact, haha! Here I try to provide some of the news that might be considered breaking news in the Quantum Computing space.

Latest Posts by Quantum News:

IBM Remembers Lou Gerstner, CEO Who Reshaped Company in the 1990s

IBM Remembers Lou Gerstner, CEO Who Reshaped Company in the 1990s

December 29, 2025
Optical Tweezers Scale to 6,100 Qubits with 99.99% Imaging Survival

Optical Tweezers Scale to 6,100 Qubits with 99.99% Imaging Survival

December 28, 2025
Rosatom & Moscow State University Develop 72-Qubit Quantum Computer Prototype

Rosatom & Moscow State University Develop 72-Qubit Quantum Computer Prototype

December 27, 2025