Research demonstrates a method, REWIRE, to enhance low-quality web data for large language model pre-training. Integrating rewritten data with high-quality texts improves performance across 22 tasks, yielding gains of up to 2.5 percentage points compared to training solely on filtered web data, and exceeding benefits from doubling data volume.

The escalating demand for computational resources in training large language models presents a significant challenge: the supply of readily available, high-quality training data is not keeping pace. Researchers are now investigating methods to maximise the utility of existing datasets, rather than solely relying on ever-larger web crawls. A team led by Thao Nguyen (FAIR at Meta & University of Washington), Yang Li (FAIR at Meta), Olga Golovneva (FAIR at Meta), Luke Zettlemoyer (FAIR at Meta & University of Washington), Sewoong Oh (University of Washington), Ludwig Schmidt (Stanford University) and Xian Li (FAIR at Meta) detail their approach in the paper, “Recycling the Web: A Method to Enhance Pre-training Data Quality and Quantity for Language Models”. Their work introduces REWIRE, a technique to refine discarded web data and integrate it with existing high-quality texts, demonstrably improving performance across a range of natural language processing tasks.

Data Recycling Enhances Large Language Model Performance

The escalating computational power dedicated to large language models (LLMs) is not matched by a corresponding increase in readily available training data. Recent research demonstrates a method for improving LLM performance by actively reshaping existing data, rather than solely increasing volume. This approach addresses a critical bottleneck in LLM development.

Researchers developed REWIRE (REcycling the Web with guIded REwrite), a pipeline that transforms lower-quality web-scraped documents into usable training material. This effectively expands the dataset without requiring further large-scale web crawling. Web scraping involves automatically extracting data from websites.

Experiments conducted at 1B, 3B, and 7B parameter scales, utilising the DCLM benchmark – a suite of diverse language understanding and generation tasks – reveal consistent performance gains when incorporating REWIRE-generated synthetic data alongside high-quality, filtered web text. Models trained with the mixed dataset achieved improvements of 1.0, 1.3, and 2.5 percentage points across 22 tasks, respectively, compared to training on filtered web data alone. This highlights the efficiency of the REWIRE pipeline. Crucially, these gains surpass those obtained by simply doubling the amount of raw web data used for training.

Analysis indicates that approximately 82% of the synthetic data integrated into the training set originates from documents initially deemed low-quality and destined for discard, maximising resource utility. The REWIRE pipeline achieves this through rephrasing and diversification of existing content, effectively ‘recycling’ previously unusable data.

The findings suggest that actively manipulating and repurposing existing web data represents a viable and efficient strategy for scaling LLM pre-training, offering a sustainable solution to data scarcity. By focusing on data quality and transformation, researchers circumvent the limitations imposed by a stagnating data supply and unlock further performance gains.

This method demonstrably outperforms alternative synthetic data generation techniques, including those based on Wikipedia-style paraphrasing, question-answer synthesis, and knowledge extraction. The observed performance improvements are not limited to overall benchmark scores; the research details gains across individual tasks within the DCLM benchmark, indicating a broad and consistent positive effect of the raw-synthetic data mix.

The study highlights the potential of data recycling as a simple yet effective method for advancing the field, offering a practical solution for developers and researchers.

👉 More information
🗞 Recycling the Web: A Method to Enhance Pre-training Data Quality and Quantity for Language Models
🧠 DOI: https://doi.org/10.48550/arXiv.2506.04689

Tags:

data augmentation. Data Filtering data recycling DCLM benchmark Language Models pre-training REWIRE Scaling Laws Synthetic Data Web Scraping

Quantum News

Recycling Web Data Boosts Large Language Model Performance Significantly.

Data Recycling Enhances Large Language Model Performance

Latest Posts by Quantum News:

Michael Saylor says Quantum Computing Remains A Risk To The Digital World

UK Research and Innovation Supports Rollout of Large-Scale Quantum Computers

CERN’s LHCb Experiment Discovers New Particle with Two Charm Quarks