Researchers Boost AI Benchmarks, Cutting Error by 20%

Evaluating the performance of large language models presents a significant challenge, as current benchmarks often provide inconsistent or unreliable results. David Heineman, Valentin Hofmann, and Ian Magnusson, along with colleagues at the Allen Institute for Artificial Intelligence, investigate the underlying properties that determine a benchmark’s reliability. Their work introduces a framework for understanding how much a benchmark can truly distinguish between better and worse models, termed ‘signal’, and how susceptible it is to random fluctuations, termed ‘noise’. The team demonstrates that benchmarks with a stronger signal and lower noise provide more dependable evaluations, particularly when making decisions based on limited data, and also improve the accuracy of predicting performance at larger scales, offering a crucial step towards more trustworthy language model development.

Large, multi-task evaluation suites are increasingly used to assess and compare the capabilities of machine learning models. This work analyses specific properties that determine the reliability of these benchmarks when making critical decisions about model performance. The research introduces two key metrics to characterise existing benchmarks: signal, which measures a benchmark’s ability to differentiate between superior and inferior models, and noise, which quantifies a benchmark’s sensitivity to random variations during training. The results demonstrate that benchmarks exhibiting a higher signal-to-noise ratio provide more reliable assessments at smaller scales, and those with lower noise yield more accurate scaling law predictions. These findings suggest that enhancing either the signal or reducing the noise within evaluation benchmarks will lead to more robust and informative evaluations.

Early Stopping and Weight Averaging Improves LLMs

Researchers have investigated methods to improve the robustness and reliability of large language models (LLMs), specifically the OLMES model. A key finding is that averaging model weights from multiple checkpoints during training, using an Exponential Moving Average (EMA), significantly improves performance compared to using only the final checkpoint. This technique helps to prevent overfitting and stabilise training. The research also explored how sensitive LLMs are to variations in the training data. Experiments examined the impact of changing the random seed used during training and shuffling the order of the training data.

Combining both a different random seed and shuffled data order proved particularly effective at improving model robustness. These tests were conducted across a range of question-answering and reasoning tasks, demonstrating consistent improvements in performance. The results show that combining seed and data order variation creates a more stable and accurate model. The data consistently demonstrates that EMA significantly improves accuracy compared to using a single checkpoint, especially as training progresses. Furthermore, the research reveals that combining seed and data order variation is a particularly effective strategy for improving robustness across various tasks. This suggests that introducing controlled variations during training can mitigate the sensitivity of LLMs to random fluctuations.

Benchmark Signal and Noise Predict Model Scaling

Developing new language models requires significant resources, and researchers must make informed decisions based on experiments with smaller models before scaling up to larger systems. A crucial challenge is identifying which benchmarks provide reliable information for these decisions. Recent work has highlighted that not all benchmarks are equally useful, prompting investigation into how to assess and improve their effectiveness. Researchers have now investigated the properties of benchmarks that contribute to their reliability, focusing on two key metrics: signal and noise. Signal measures a benchmark’s ability to clearly differentiate between better and worse language models, while noise reflects its sensitivity to random variations during the training process.

The team discovered a strong correlation between a benchmark’s signal-to-noise ratio and its usefulness in predicting the performance of larger models from experiments with smaller ones. Benchmarks with a high signal-to-noise ratio consistently provided more reliable insights, allowing researchers to confidently scale up promising approaches. Conversely, benchmarks with low signal or high noise led to inaccurate predictions and wasted resources. To improve benchmark quality, the researchers explored several interventions. Switching to evaluation metrics with better signal and lower noise, such as perplexity instead of simple accuracy, significantly improved reliability.

Filtering out noisy subtasks within a benchmark, even if it reduced the overall number of test cases, also boosted the signal-to-noise ratio and enhanced predictive power. Averaging model outputs across multiple training checkpoints reduced noise and consistently improved performance predictions. The team compiled a new dataset of nearly 900,000 benchmark results from 375 language models, ranging in size from 60 million to 32 billion parameters. This resource allows for a comprehensive analysis of benchmark quality and provides a foundation for developing more robust and reliable evaluation procedures. The findings emphasize the importance of prioritizing benchmarks with high signal and low noise, offering a practical guide for researchers seeking to accelerate progress in language model development and ensure efficient allocation of computational resources.

Benchmark Quality, Signal and Noise Analysis

This research investigates the reliability of benchmarks used to evaluate large language models, revealing that not all benchmarks are equally effective at distinguishing between model performance. The study introduces two key metrics, signal and noise, to quantify benchmark quality; signal measures a benchmark’s ability to differentiate between good and poor models, while noise reflects its sensitivity to random variations during training. Results demonstrate that benchmarks with a strong signal-to-noise ratio are more dependable for small-scale model comparisons and yield more accurate predictions of performance at larger scales. The authors further propose and test interventions to improve benchmark quality, including using evaluation metrics with better signal and noise characteristics, filtering noisy subtasks within a benchmark, and averaging model outputs across multiple training checkpoints to reduce variability.

These interventions consistently improve the reliability of evaluations. The researchers acknowledge that the effectiveness of these interventions may vary depending on the specific benchmark and model being evaluated, and that further research is needed to fully understand the interplay between signal, noise, and benchmark design. They have made publicly available a large dataset of benchmark results to facilitate future work in this area, and encourage developers to prioritize high signal and low noise when creating or selecting benchmarks for evaluating language models.

👉 More information
🗞 Signal and Noise: A Framework for Reducing Uncertainty in Language Model Evaluation
🧠 ArXiv: https://arxiv.org/abs/2508.13144

Quantum News

Quantum News

As the Official Quantum Dog (or hound) by role is to dig out the latest nuggets of quantum goodness. There is so much happening right now in the field of technology, whether AI or the march of robots. But Quantum occupies a special space. Quite literally a special space. A Hilbert space infact, haha! Here I try to provide some of the news that might be considered breaking news in the Quantum Computing space.

Latest Posts by Quantum News:

IBM Remembers Lou Gerstner, CEO Who Reshaped Company in the 1990s

IBM Remembers Lou Gerstner, CEO Who Reshaped Company in the 1990s

December 29, 2025
Optical Tweezers Scale to 6,100 Qubits with 99.99% Imaging Survival

Optical Tweezers Scale to 6,100 Qubits with 99.99% Imaging Survival

December 28, 2025
Rosatom & Moscow State University Develop 72-Qubit Quantum Computer Prototype

Rosatom & Moscow State University Develop 72-Qubit Quantum Computer Prototype

December 27, 2025