Masked Diffusion Language Models (MDLMs) represent a powerful new approach to text generation, offering faster creation and a richer understanding of context compared to traditional methods, but a critical mismatch between how these models learn and how they generate text has remained largely unaddressed. Haoyu He, Katrin Renz, Yong Cao, and Andreas Geiger from the University of Tübingen now demonstrate that this discrepancy, where models are trained with randomly masked words but generate text by progressively revealing hidden words, significantly impacts performance. Their research introduces Masked Diffusion Policy Optimization (MDPO), a novel technique that frames the learning process as a sequential decision-making problem and trains the model to refine text in the same way it generates it. The results show MDPO achieves comparable performance to existing state-of-the-art methods with a dramatic reduction in training time, alongside substantial improvements on challenging mathematical and reasoning tasks, and establishes a new direction for bridging the gap between pre-training and inference in masked diffusion models.
These models combine the strengths of diffusion models and language modeling techniques, functioning by masking parts of the input and learning to reconstruct the hidden information. The research explores how different approaches to processing the sequence, such as working with blocks of text, impact performance. The team discovered that using medium-sized blocks generally yields the best results, balancing parallel processing with sequential understanding. Very large blocks, however, can hinder performance, and the number of reconstruction steps also plays a crucial role; more steps improve accuracy but increase computational demands.
Strategies that focus on areas where the model is less confident prove more effective when re-masking portions of the sequence. Experiments using the MATH-500 dataset revealed that focusing on the model’s confidence levels when re-masking tokens consistently outperforms random selection. A key issue arises from a mismatch between how MDLMs are trained and how they operate during text generation; they randomly mask tokens during training but progressively reveal structure by unmasking tokens based on confidence levels during generation. To bridge this gap, the team framed the reconstruction process as a sequential decision-making problem and applied reinforcement learning techniques. Leveraging the properties of diffusion models, the team optimized the model with significantly fewer training updates compared to existing methods. These models generate text more quickly and can better utilize context, but they often struggle with a discrepancy between training and text generation. During training, the model sees randomly masked words, while during generation it progressively reveals the text. This work addresses this issue by framing the learning process as a sequential decision-making problem and applying reinforcement learning techniques. Remarkably, MDPO achieves comparable results to state-of-the-art methods with sixty times fewer training steps, demonstrating a substantial gain in efficiency. Experiments on challenging benchmarks show average improvements of 9. 6% on MATH500 and 54. 2% on Countdown, highlighting the method’s effectiveness.
Unlike previous methods that freeze predictions, researchers introduced Running Confidence Remasking (RCR), which continuously tracks confidence levels and enables flexible remasking of low-confidence tokens. This training-free technique consistently improves performance and provides additional gains when combined with MDPO. While MDLMs progressively refine predictions during text generation, standard training methods ignore this process, potentially limiting performance. The results demonstrate that MDPO matches the performance of existing state-of-the-art methods while requiring significantly fewer training steps, and achieves improvements on several benchmark tasks.
Additionally, a simple remasking strategy, termed Running Confidence Remasking (RCR), further enhances performance by allowing for flexible revisiting of earlier predictions. These findings highlight the potential of explicitly addressing the training-inference gap in MDLMs and offer a pathway to more efficient and effective language model training. The authors acknowledge a limitation in that their work focuses on tasks with clearly defined, verifiable reward functions, and future research will explore applying MDPO to more general tasks using modern validation techniques.
👉 More information
🗞 MDPO: Overcoming the Training-Inference Divide of Masked Diffusion Language Models
🧠 ArXiv: https://arxiv.org/abs/2508.13148
