Researchers are tackling the significant challenge of optimising data preparation for large language models (LLMs), recognising that high-quality training data is crucial for performance. Yicheng Chen from Fudan University, Zerun Ma, Xinchen Xie, Yining Li, and Kai Chen from Shanghai AI Laboratory present a novel approach to automate the design of ‘data recipes’ , the pipelines used to transform raw data into effective training corpora. This work introduces DataChef-32B, a system employing reinforcement learning to generate complete data recipes given a target task and available data sources, working collaboratively across Fudan University and the Shanghai AI Laboratory. Demonstrating its effectiveness across six tasks, DataChef-32B generates recipes achieving performance comparable to those curated by human experts, notably surpassing the performance of Qwen3-1.7B on the AIME’25 benchmark with a score of 66.7, and offering a pathway towards self-evolving AI systems and automated LLM training.
The performance of these models increasingly relies on meticulously curated datasets, assembled using a process known as a ‘data recipe’, a pipeline transforming raw information into usable training corpora. Despite progress in using LLMs to automate individual steps within this process, designing the overall data recipe remains a manual, expertise-driven undertaking. Researchers have developed DataChef-32B, a system capable of generating complete data recipes given a target benchmark and a selection of available data sources. DataChef-32B employs online reinforcement learning, guided by a proxy reward that accurately predicts how well a candidate recipe will perform on downstream tasks. To overcome the challenges of limited data and the expense of evaluating full model training, the team developed a ‘data verifier’, a method for assessing training data quality without requiring complete model training runs. This provides a rapid, low-cost reward signal for the reinforcement learning process, enabling scalable and efficient recipe optimisation. The research introduces a comprehensive task pool encompassing 31 benchmarks across 10 domains, including mathematics, coding, finance, and medicine, alongside 257 associated datasets. Each task leveraged between eight and fifteen source datasets, ensuring diversity in the training material. Across six independent evaluation tasks, the recipes produced by DataChef-32B achieve performance comparable to those painstakingly crafted by human experts. This achievement highlights the system’s ability to not only automate the recipe creation but also to enhance the capabilities of the base LLM. A policy language model underpinned the core of the methodology, generating complete data recipes for adapting base large language models (LLMs) to target tasks. This approach moves beyond simply automating individual data processing steps, instead focusing on end-to-end recipe creation. The system receives a task definition, comprising a natural language instruction, available data sources, and an evaluation metric, and outputs a data pipeline formulated as Python scripts. These scripts detail the precise sequence of operations to transform raw data into a training dataset. To facilitate automated evaluation, a dedicated Data Verifier was developed, assessing the generated training data and providing a scalar reward signal that reflects data quality and the pipeline’s executability. The Data Verifier operates by sampling a subset of the generated data and evaluating it against a rubric-guided evaluation scheme, assigning scores based on criteria such as validity, format correctness, and task relevance. A key innovation lies in the integration of a Code Interpreter, which executes the generated Python scripts to ensure they function as intended and to identify potential errors. Reinforcement learning, specifically the Generalised Reward-augmented Policy Optimisation (GRPO) algorithm, drives the learning process. The policy LLM is trained online, iteratively refining its recipe generation capabilities based on the rewards received from the Data Verifier. This allows the system to explore a vast space of possible data recipes, identifying those that yield the highest downstream performance on the target task. DataChef-32B’s recipes incorporated several key data processing steps, including outlier filtering, chain-of-thought synthesis, data standardization, mixing, and de-duplication. The data verifier proved to be a reliable predictor of downstream performance, providing a low-cost, immediate reward signal for the reinforcement learning process, accelerating the optimisation of data recipes. The framework was designed to address an open-ended setting, accepting arbitrary tasks and datasets as input, thereby moving beyond evaluations limited to static datasets and curated initial code. The relentless pursuit of better large language models (LLMs) has largely focused on scaling up, increasing parameters and data volume. However, a fundamental bottleneck remains: assembling the right training data in the first place. For years, this process has been a painstaking, manual effort, reliant on human expertise to curate and refine datasets. The reliance on a proxy reward function introduces a potential limitation; while the proxy accurately predicts downstream performance in this instance, its generalizability to other domains remains an open question. Furthermore, the system currently operates within a defined pool of existing data sources; truly novel data discovery or creation is not yet within its capabilities. Looking ahead, we can anticipate a convergence of these automated recipe generation techniques with active learning strategies, potentially unlocking a virtuous cycle of improvement, accelerating progress in LLMs and extending the principle to other areas of artificial intelligence where data curation is a critical constraint.
👉 More information
🗞 DataChef: Cooking Up Optimal Data Recipes for LLM Adaptation via Reinforcement Learning
🧠 ArXiv: https://arxiv.org/abs/2602.11089
