Language models frequently sacrifice the breadth of their responses in pursuit of accuracy and helpfulness, creating a significant limitation for tasks demanding creativity and exploration. Tianjian Li, Yiming Zhang from Carnegie Mellon University, and Ping Yu, alongside Swarnadeep Saha, Daniel Khashabi, and Jason Weston from Meta, address this challenge with a new framework called Diversity-Aware Reinforcement Learning, or DARLING. This method moves beyond simply assessing surface-level differences in text, instead employing a learned function to measure semantic diversity, and integrates this with a quality assessment during the learning process. The results demonstrate that DARLING consistently improves both the quality and novelty of generated text across a range of tasks, from creative writing to solving mathematical problems, and importantly, reveals that actively encouraging diversity actually enhances the learning process itself, leading to even better responses.
Language Model Diversity and Exploration Challenges
Large language models often lose diversity during training, becoming overly focused on generating high-reward responses and producing repetitive outputs. This limits their creativity, hinders exploration, and impacts the training process itself, as diversity is crucial for effective learning. Existing methods for encouraging diversity are often computationally expensive or don’t generalize well across different models and training setups. This research introduces Gradient-Reward-Proportionate Optimization (GRPO), a simple weighting mechanism that adjusts rewards based on the magnitude of the gradient for a given response.
The core idea is to incentivize larger gradient updates for responses that are both high-quality and diverse, encouraging the model to move more significantly towards diverse solutions. GRPO doesn’t require changes to the data generation process or complex hyperparameter tuning; it’s a straightforward modification to the reward function. Results demonstrate that GRPO improves both the quality and diversity of generated text, particularly in tasks like math problem-solving. Analysis revealed that models sometimes attempt to “hack” diversity rewards by generating irrelevant text after a correct answer, simply to increase their diversity score, highlighting the importance of careful reward design.
This work offers a simple yet effective way to address the critical problem of diversity loss in language models, leading to more creative, robust, and efficient AI systems. Imagine teaching a student to solve math problems. You don’t just want them to get the right answer; you also want them to understand the concepts and approach problems from different angles. GRPO is like giving the language model extra encouragement to explore different solution paths, even if they don’t immediately lead to the correct answer. This helps the model develop a deeper understanding and become a more creative and robust problem-solver.
Diversity Metric for Language Model Outputs
Researchers developed Diversity-Aware Reinforcement Learning (DARLING) to address a common challenge in large language models: the tendency to prioritize accuracy and helpfulness at the expense of generating diverse outputs. This method aims to balance response quality with semantic diversity, enabling more creative and exploratory applications like brainstorming and storytelling. The core of DARLING lies in a learned partition function that measures diversity beyond simple lexical variations, allowing the system to identify genuinely distinct ideas. To quantify diversity, scientists trained a binary classifier to determine semantic equivalence between responses, grouping similar outputs into clusters.
This classifier forms the basis of a diversity metric, calculating the average pairwise distance between a given response and all others within a set. For example, when presented with prompts requesting programming jokes, the system accurately grouped responses utilizing puns on the word “bug” as semantically equivalent, assigning a lower diversity score than to jokes employing entirely different concepts. This nuanced approach moves beyond surface-level comparisons, capturing deeper differences in meaning. DARLING then integrates this diversity signal into a reinforcement learning framework, combining it with a quality reward during training.
Researchers multiplied the quality and diversity rewards, rather than simply adding them, to avoid one signal overshadowing the other. Furthermore, the team refined standard reinforcement learning techniques by switching from sequence-level to token-level loss averaging and removing standard deviation normalization, improving training stability and reducing noise. This innovative approach enables the generation of more varied and creative outputs from large language models.
Diversity and Quality in Language Models
Researchers have developed a new framework, Diversity-Aware Reinforcement Learning (DARLING), that simultaneously optimizes both the quality and diversity of responses generated by large language models. This addresses a common problem where improving accuracy often comes at the expense of generating varied and creative outputs, limiting usefulness in tasks like brainstorming or storytelling. The team discovered that by explicitly encouraging diversity, they could not only maintain high-quality responses but also unlock greater exploration during the learning process. At the heart of DARLING is a method for measuring diversity that goes beyond simple lexical differences, utilizing a learned classifier to assess semantic similarity between responses.
This allows the framework to group responses with equivalent meanings, effectively identifying and rewarding genuinely novel outputs. The researchers then combine this diversity signal with a traditional quality reward during reinforcement learning, amplifying the benefits of responses that are both accurate and distinct. Experiments demonstrate that DARLING consistently outperforms standard quality-focused reinforcement learning baselines across a range of tasks. Specifically, on five benchmarks involving creative writing and instruction following, DARLING achieved higher quality and novelty in generated responses.
Importantly, the framework also excels in verifiable tasks, such as solving competition math problems, achieving improvements in both solution accuracy and the variety of solutions generated. The results show that explicitly optimizing for diversity doesn’t just broaden the range of outputs, it actually leads to higher-quality responses overall, suggesting that exploration and quality are intrinsically linked. This breakthrough delivers a powerful new approach to language model training, paving the way for more creative, versatile, and effective AI systems.
Diversity Boosts Quality in Language Models
This work introduces DARLING, a new online reinforcement learning method that simultaneously optimizes for both the quality and diversity of language model outputs. Unlike previous approaches which often struggle to maintain diversity during training, DARLING effectively preserves a broad range of responses while also improving overall quality. Experiments across various model sizes and tasks, including both those with verifiable solutions and more open-ended creative tasks, demonstrate DARLING’s consistent outperformance compared to methods focused solely on quality. The research shows that explicitly encouraging diversity during training not only expands the range of generated ideas but also surprisingly enhances the quality of responses, likely by promoting more thorough exploration of the solution space. While the authors acknowledge that prompting methods can also improve diversity, they highlight that these often come at the cost of reduced quality, a trade-off avoided by directly modifying the training objective. Future work could explore the application of DARLING to even more complex tasks and investigate how the learned diversity signal interacts with different model architectures and training datasets.
👉 More information
🗞 Jointly Reinforcing Diversity and Quality in Language Model Generations
🧠 ArXiv: https://arxiv.org/abs/2509.02534
