Researchers are tackling the challenge of delivering truly relevant news to users across diverse online platforms. Mengdan Zhu from Emory University, Yufan Zhao from Microsoft, and Tao Di from Microsoft, working with colleagues Yulan Yan and Liang Zhao from Emory University and Microsoft respectively, present a novel reinforcement learning framework designed to infer deeper user interests from varied online signals. This collaborative work addresses a critical limitation of current news recommendation systems, the inability to move beyond simple behavioural tracking to understand underlying information needs. By generating high-quality search queries representing user interests and employing a distillation technique for scalable deployment, the team demonstrates consistent improvements in both interest quality and recommendation performance through extensive offline experiments and large-scale online A/B testing within a production news system.
Delivering personalised content requires understanding what truly interests each user, not just what they’ve clicked on before. This work offers a novel approach to inferring those deeper interests and deploying it within real-world news platforms.
Researchers have developed a new reinforcement learning framework that leverages large language models to discern user interests from diverse online signals, significantly improving news recommendation systems. This work addresses a critical challenge in cross-domain recommendation, moving beyond simple user behaviours to capture nuanced, reusable interests at scale.
The team trained a language model to generate lists of relevant news search queries, effectively translating complex user interactions into actionable information needs. By framing query-list generation as a policy optimisation problem and employing a technique called Group Relative Policy Optimisation with multiple reward signals, the system learns to anticipate what information a user seeks, even when that need isn’t directly expressed through clicks or searches.
Systematic investigation into computational resources revealed consistent performance gains with increased model capacity and inference-time sampling, demonstrating a scaling-like behaviour that suggests further improvements are possible with even more powerful hardware. To facilitate real-world deployment, the researchers successfully transferred the learned intelligence from a large, computationally intensive model to a compact “student” model via on-policy distillation.
This distillation process preserves the core benefits of the advanced model while enabling fast and efficient online serving. Extensive offline evaluations, detailed ablation studies, and large-scale A/B tests within a production news platform confirm that this approach consistently enhances both the quality of user interest modelling and the overall performance of the news recommendation system.
The core innovation lies in reformulating user interest discovery as a query-list generation task. Given a user’s history across various online activities, the language model constructs a series of search queries that represent their underlying interests. This approach moves beyond traditional embedding-based methods, which struggle to capture complex sequential signals and often rely on superficial similarities between users.
The model is trained using five distinct reward signals, retrieval alignment, interest coverage, query specificity, intra-list diversity, and structural validity, ensuring that the generated queries are not only relevant but also comprehensive, focused, and well-structured. Furthermore, the study demonstrates the scaling properties of large language models for interest modelling, showing that both increasing model size and the duration of inference consistently improve performance.
The on-policy distillation technique is a key enabler for practical deployment, allowing the benefits of a powerful, reasoning-driven model to be realised within the constraints of a large-scale production environment. Validation through both offline experiments and live A/B testing confirms the effectiveness of this approach, representing a significant step towards more intelligent and personalised news recommendation.
Performance of Qwen2.5-32B and scaling benefits of model capacity on news recommendation
Optimising news query lists via reinforcement learning and multi-faceted reward signals
A reinforcement learning framework underpinned this work, training a model to generate lists of high-quality news search queries representing user interests from diverse signals. The core of the methodology involves formulating query-list generation as a policy optimisation problem, employing Group Relative Policy Optimisation (GRPO) to refine the model’s ability to select relevant queries.
GRPO was implemented with a carefully constructed suite of five reward signals, each designed to assess a specific facet of query-list quality: retrieval alignment, interest coverage, query specificity, intra-list diversity, and structural validity. This multi-faceted reward system encourages the generation of query lists that are not only relevant to the user’s inferred needs but also comprehensive, precise, varied, and logically coherent.
To understand the impact of computational resources, the research systematically varied inference-time sampling and model capacity. Inference-time sampling refers to the number of samples drawn during the query generation process, while model capacity denotes the size and complexity of the language model itself. Both parameters were increased incrementally to observe their effect on interest quality and subsequent retrieval performance, revealing consistent improvements exhibiting scaling-like behaviour.
This detailed analysis demonstrates the potential for further gains with increased compute. Recognising the challenges of deploying a computationally intensive model in a production environment, an on-policy distillation technique was applied. This process transferred the learned policy from a large, complex “teacher” language model to a compact “student” model.
The student model maintains the core interest modelling capabilities while significantly reducing latency and increasing throughput, enabling scalable online serving within the news recommendation system. This distillation ensures practical implementation without sacrificing performance.
Reinforcement learning infers user interests via generated search queries
The relentless pursuit of relevance in online news has led to a fascinating, and increasingly practical, application of reinforcement learning. For years, news recommendation systems have relied on relatively superficial signals, such as articles a user clicks on and time spent reading, to predict future interests. But these approaches struggle to grasp the underlying reasons behind a user’s behaviour, limiting their ability to suggest genuinely novel and engaging content.
This work represents a step change, moving beyond simple pattern matching to infer deeper, reusable interests from a wider range of user signals. What distinguishes this research is the elegant use of reinforcement learning to generate search queries, effectively asking the system to articulate what the user might be looking for rather than simply reacting to past clicks.
The scaling behaviour observed with increased computational power is particularly encouraging, hinting at a pathway towards more sophisticated and personalised recommendations without prohibitive costs. The distillation process, shrinking a large, powerful model into a more manageable one, is crucial for real-world deployment. However, the reliance on reward signals introduces inherent challenges.
The ablation studies reveal a delicate balancing act; removing any single component degrades performance, highlighting the complexity of defining “meaningful informational interest”. The system still needs careful calibration to avoid reward hacking, where the algorithm optimises for the reward itself rather than genuine user satisfaction. Future work will likely focus on refining these reward functions and exploring more robust methods for filtering out noise from user behaviour.
Ultimately, this isn’t just about better news recommendations. It’s about building systems that can truly understand user intent across diverse online activities, a capability with implications far beyond the news domain, potentially impacting search, education, and even personal assistants. The ability to translate complex user behaviour into actionable queries is a significant advance, and one that promises to reshape how we interact with information online.
👉 More information
🗞 Learning User Interests via Reasoning and Distillation for Cross-Domain News Recommendation
🧠 ArXiv: https://arxiv.org/abs/2602.15005
