Direct Preference Optimization has emerged as a leading technique for refining large language models, prized for its stability and ease of use, but it can struggle with noisy or unreliable data. Cheol Woo Kim, Shresth Verma, and Mauricio Tec, all from Harvard University, along with Milind Tambe, address this challenge with a new algorithm, DPO-PRO, which enhances robustness without sacrificing efficiency. Their approach leverages distributionally robust optimization, but unlike previous methods, DPO-PRO specifically targets uncertainty in preference data, avoiding excessive caution and keeping computational demands low. The team demonstrates that DPO-PRO not only improves performance on standard benchmarks, but also delivers more reliable results when faced with imperfect preference signals, representing a significant step towards more trustworthy and adaptable language models.
We propose using distributionally robust optimization (DRO) to address potential noise and distributional shift in the data. Existing methods often suffer from excessive conservatism and high computational cost. DPO-PRO accounts for uncertainty in the preference distribution through a lightweight DRO formulation, offering a computationally efficient approach to robust training. Importantly, unlike prior DRO-based variants, DPO-PRO focuses solely on uncertainty in preferences, thereby avoiding unnecessary conservatism and incurring negligible computational overhead. We further demonstrate that DPO-PRO is mathematically equivalent to a regularized DPO objective that penalizes model overconfidence.
LLMs Automate Rewards for Call Optimisation
Scientists have developed a novel system that uses large language models (LLMs) to automatically design reward functions for reinforcement learning agents. Traditionally, creating effective reward functions, the rules that guide an AI’s learning, requires significant manual effort. This research tackles this challenge by leveraging LLMs to generate and refine these reward functions in a real-world scenario: allocating phone calls to mothers in India. The goal is to prioritize calls to those who would benefit most, considering factors such as education level, income, and age. The system employs a pipeline where an LLM generates Python code representing potential reward functions based on a given goal.
Another instance of the LLM then acts as a judge, evaluating different reward functions based on their performance and alignment with the desired objective. This process involves creating a dataset of preferences, where the LLM compares pairs of reward functions and indicates which is better. This preference data is then used to fine-tune the LLM, improving its ability to generate effective reward functions. The generated functions are tested in a simulated environment to assess their performance. The team provides detailed examples of the prompts used to interact with the LLM, including instructions for generating reward function code and evaluating their performance.
These prompts emphasize the need for simple, positive, and increasing rewards. Experiments demonstrate the feasibility of this approach, showcasing how LLMs can automate the design of reward functions and improve the performance of reinforcement learning agents in a practical application. This research delivers a significant advancement in artificial intelligence, enabling the creation of more adaptable and efficient learning systems.
Robust LLMs with Distributionally Robust Optimization
Scientists have developed DPO-PRO, a new method for refining large language models (LLMs) that significantly improves robustness to noisy data. Researchers achieved this by incorporating distributionally robust optimization (DRO) into the DPO framework, creating a system that accounts for uncertainty in how preferences are expressed. The core of DPO-PRO lies in its lightweight DRO formulation, which focuses specifically on uncertainty within the preference distribution itself, avoiding unnecessary conservatism seen in previous approaches.
This targeted approach minimizes computational demands while maximizing robustness, allowing for efficient training without sacrificing performance. Importantly, the team demonstrated that DPO-PRO is mathematically equivalent to a regularized DPO objective, effectively penalizing the model when it exhibits overconfidence under weak preference signals. Experiments reveal that DPO-PRO consistently outperforms both standard DPO and existing DRO-based methods in handling noisy preference data. The team evaluated the system on standard alignment benchmarks and a real-world public health task, demonstrating its practical applicability. This research delivers a significant advancement in LLM training, enabling the creation of more reliable and trustworthy models capable of aligning with human preferences even when those preferences are imperfectly expressed. This is achieved through a lightweight formulation that specifically focuses on uncertainty within the preference distributions themselves, minimizing computational demands. Importantly, the researchers demonstrate that DPO-PRO can be understood as a regularized version of standard DPO, effectively penalizing overconfidence when preference signals are weak.
Evaluations on established benchmarks and a public health task confirm that DPO-PRO consistently improves robustness to noisy preference signals compared to existing DPO variants. Future work could explore the application of DPO-PRO to more complex scenarios and investigate methods for automatically tuning its parameters for optimal performance. This research delivers a significant advancement in machine learning, enabling the creation of more reliable and adaptable language models capable of aligning with human preferences even in the presence of imperfect or uncertain data.
👉 More information
🗞 Lightweight Robust Direct Preference Optimization
🧠 ArXiv: https://arxiv.org/abs/2510.23590
