Optimizing budget allocation for online advertising whilst maximising returns remains a significant hurdle for advertisers. Mingxuan Song from Peking University, Yusen Huo from Alibaba Group, and Bohan Zhou, alongside Shenglin Yin, Zhen Xiao and Jieyi Long et al., tackle this problem with a novel approach detailed in their research. They introduce DARA, a dual-phase framework leveraging the in-context learning capabilities of Large Language Models (LLMs) to overcome the limitations of traditional reinforcement learning methods when historical data is scarce. By combining LLM reasoning with a fine-grained optimisation process , and employing a new post-training strategy called GRPO-Adaptive , DARA demonstrably improves cumulative advertiser value under budget constraints, offering a promising step forward in AI-Generated Bidding (AIGB) technology.
DARA framework blends LLMs and reinforcement learning
The team achieved this by recognising the inherent structure of budget allocation, a separation between high-level planning and fine-grained optimisation, and designing a system to exploit this distinction. Building on this foundation, DARA decomposes the decision-making process into two distinct stages: a ‘reasoner’ that generates initial budget plans using in-context prompting, and a ‘fine-grained optimizer’ that refines these plans based on feedback-driven reasoning. Experiments show that this dual-phase approach effectively combines LLM generalisation with RL’s optimisation capabilities, leading to superior results. The researchers also designed a simulation environment, inspired by sim-to-real learning, to generate diverse budget allocation scenarios and address the challenge of limited few-shot data.
This environment is modelled after real-world data distributions, enabling robust training and evaluation of the proposed framework. The work opens new avenues for personalised budget planning in data-scarce advertising environments, offering a promising solution for optimising return on investment in dynamic online marketplaces. Furthermore, ablation studies revealed that a single-stage prompting setup struggles to effectively capture data regularities and execute fine-grained budget optimisation, highlighting the importance of the dual-phase approach. The research establishes that separating the generalization and optimization processes, and aligning model capabilities with the demands of each phase, is crucial for achieving robust and effective budget allocation. This breakthrough reveals a pathway to combine the strengths of LLMs and RL, paving the way for more intelligent and adaptive advertising systems capable of maximising advertiser value even with limited data and dynamic environments.
DARA Framework for Few-Shot AIGB Optimisation leverages data
Researchers recognised that advertisers frequently operate with personalised objectives but lack extensive interaction data, creating ‘few-shot’ scenarios where RL struggles to generalise effectively. This reasoner harnesses the LLM’s ability to learn from limited examples provided directly in the input prompt, enabling rapid adaptation to new advertiser goals. This adaptive process refines the LLM’s ability to handle the precise calculations required for budget allocation. Subsequently, the generated plans were refined by a fine-grained optimiser employing feedback-driven reasoning. This stage meticulously adjusts the initial plans, leveraging feedback signals to improve performance within budget constraints.
Experiments employed both real-world and synthetic data environments to rigorously evaluate DARA’s performance. The system delivers consistent outperformance compared to existing baseline methods in terms of cumulative advertiser value, a key metric for assessing advertising campaign success. The research team meticulously measured performance, demonstrating that DARA consistently achieves higher returns on investment under budgetary limitations. The innovative methodology allows for effective budget allocation even with limited data, paving the way for more personalised and efficient online advertising campaigns. The technique reveals a significant improvement in handling complex decision-making tasks where LLMs previously struggled with numerical precision and generalisation.
DARA framework boosts AI bidding performance
Scientists achieved a significant breakthrough in online advertising by developing DARA, a novel dual-phase framework for budget allocation, demonstrating consistently superior performance over existing methods. The team measured marginal ROI variance to assess the balance of budget distribution across time periods, finding that DARA significantly outperforms baseline algorithms across all steps, reducing variance and demonstrating strong stability in dynamic environments. Data shows that DARA consistently exhibits lower variance compared to DPO, a method relying on supervised or preference-aligned fine-tuning, indicating a more balanced and effective allocation of resources over time. Notably, DARA achieved particularly significant improvements in reducing marginal ROI variance during later stages of allocation, attributed to its fine-grained optimisation phase leveraging recent outcomes for more reliable guidance.
This separation of early generalisation from late-stage adaptation allows the model to focus on precise, context-aware refinements, a capability lacking in single-stage baselines. Tests prove that DARA’s performance surpasses that of ABPlanner’, another RL-based budget strategy design, particularly in later steps, due to enhanced numerical sensitivity and the structured separation of tasks. An ablation study, utilising four configurations, demonstrated that a single LLM performing end-to-end planning resulted in the highest marginal ROI variance, highlighting the limitations of LLMs in numerically sensitive, temporally dynamic tasks. Introducing RL fine-tuning to a single-phase setup offered a slight improvement, confirming that reinforcement learning can enhance numerical reasoning, but gains remained limited.
Adopting a dual-phase architecture without RL substantially improved performance, confirming that task decomposition is critical for complex budget allocation. Measurements confirm that the full DARA model, incorporating RL fine-tuning into both LLMs, achieved the best overall performance .The Few Shot Reasoner benefited from RL by generating more strategic initial plans, while the Fine-grained Optimizer became more numerically sensitive and responsive to feedback, resulting in a significantly lower variance and more stable performance, a substantial amplification of the RL effect within the dual-phase architecture. The research was conducted on enterprise-level servers with 8 NVIDIA H20 GPUs, each with 96GB of memory, and each experiment was repeated five times to ensure reliability, reporting mean performance with 95% confidence intervals.
DARA framework boosts AI ad spend optimisation
Researchers also introduced GRPO-Adaptive, a post-training strategy that improves both the reasoning abilities and numerical precision of LLMs by dynamically updating a reference policy during training. The authors acknowledge that the performance relies on appropriate KL regularization to maintain a stable policy and that an intermediate update frequency provides the best balance during training . Future work could explore extending this dual-phase architecture to other budget allocation problems or investigating alternative methods for dynamically updating the reference policy. This work offers a promising direction for integrating few-shot LLM reasoning with reinforcement learning fine-tuning in budget allocation scenarios, potentially leading to more effective and efficient online advertising campaigns .
👉 More information
🗞 DARA: Few-shot Budget Allocation in Online Advertising via In-Context Decision Making with RL-Finetuned LLMs
🧠 ArXiv: https://arxiv.org/abs/2601.14711
