Recycling Web Data Boosts Large Language Model Performance Significantly.

Research demonstrates a method, REWIRE, to enhance low-quality web data for large language model pre-training. Integrating rewritten data with high-quality texts improves performance across 22 tasks, yielding gains of up to 2.5 percentage points compared to training solely on filtered web data, and exceeding benefits from doubling data volume.

The escalating demand for computational resources in training large language models presents a significant challenge: the supply of readily available, high-quality training data is not keeping pace. Researchers are now investigating methods to maximise the utility of existing datasets, rather than solely relying on ever-larger web crawls. A team led by Thao Nguyen (FAIR at Meta & University of Washington), Yang Li (FAIR at Meta), Olga Golovneva (FAIR at Meta), Luke Zettlemoyer (FAIR at Meta & University of Washington), Sewoong Oh (University of Washington), Ludwig Schmidt (Stanford University) and Xian Li (FAIR at Meta) detail their approach in the paper, “Recycling the Web: A Method to Enhance Pre-training Data Quality and Quantity for Language Models”. Their work introduces REWIRE, a technique to refine discarded web data and integrate it with existing high-quality texts, demonstrably improving performance across a range of natural language processing tasks.

Data Recycling Enhances Large Language Model Performance

The escalating computational power dedicated to large language models (LLMs) is not matched by a corresponding increase in readily available training data. Recent research demonstrates a method for improving LLM performance by actively reshaping existing data, rather than solely increasing volume. This approach addresses a critical bottleneck in LLM development.

Researchers developed REWIRE (REcycling the Web with guIded REwrite), a pipeline that transforms lower-quality web-scraped documents into usable training material. This effectively expands the dataset without requiring further large-scale web crawling. Web scraping involves automatically extracting data from websites.

Experiments conducted at 1B, 3B, and 7B parameter scales, utilising the DCLM benchmark – a suite of diverse language understanding and generation tasks – reveal consistent performance gains when incorporating REWIRE-generated synthetic data alongside high-quality, filtered web text. Models trained with the mixed dataset achieved improvements of 1.0, 1.3, and 2.5 percentage points across 22 tasks, respectively, compared to training on filtered web data alone. This highlights the efficiency of the REWIRE pipeline. Crucially, these gains surpass those obtained by simply doubling the amount of raw web data used for training.

Analysis indicates that approximately 82% of the synthetic data integrated into the training set originates from documents initially deemed low-quality and destined for discard, maximising resource utility. The REWIRE pipeline achieves this through rephrasing and diversification of existing content, effectively ‘recycling’ previously unusable data.

The findings suggest that actively manipulating and repurposing existing web data represents a viable and efficient strategy for scaling LLM pre-training, offering a sustainable solution to data scarcity. By focusing on data quality and transformation, researchers circumvent the limitations imposed by a stagnating data supply and unlock further performance gains.

This method demonstrably outperforms alternative synthetic data generation techniques, including those based on Wikipedia-style paraphrasing, question-answer synthesis, and knowledge extraction. The observed performance improvements are not limited to overall benchmark scores; the research details gains across individual tasks within the DCLM benchmark, indicating a broad and consistent positive effect of the raw-synthetic data mix.

The study highlights the potential of data recycling as a simple yet effective method for advancing the field, offering a practical solution for developers and researchers.

👉 More information
🗞 Recycling the Web: A Method to Enhance Pre-training Data Quality and Quantity for Language Models
🧠 DOI: https://doi.org/10.48550/arXiv.2506.04689

Dr. Donovan

Dr. Donovan

Dr. Donovan is a futurist and technology writer covering the quantum revolution. Where classical computers manipulate bits that are either on or off, quantum machines exploit superposition and entanglement to process information in ways that classical physics cannot. Dr. Donovan tracks the full quantum landscape: fault-tolerant computing, photonic and superconducting architectures, post-quantum cryptography, and the geopolitical race between nations and corporations to achieve quantum advantage. The decisions being made now, in research labs and government offices around the world, will determine who controls the most powerful computers ever built.

Latest Posts by Dr. Donovan:

SuperQ’s SuperPQC Platform Gains Global Visibility Through QSECDEF

SuperQ’s SuperPQC Platform Gains Global Visibility Through QSECDEF

April 11, 2026
Database Reordering Cuts Quantum Search Circuit Complexity

Database Reordering Cuts Quantum Search Circuit Complexity

April 11, 2026
SPINS Project Aims for Millions of Stable Semiconductor Qubits

SPINS Project Aims for Millions of Stable Semiconductor Qubits

April 10, 2026