Researchers are increasingly focused on ensuring the safety of advanced artificial intelligence systems, yet current benchmarking practices present considerable challenges. Cheng Yu, alongside Severin Engelmann (Cornell University) and Ruoxuan Cao, Dalia Ali, and Orestis Papakyriakopoulos (Technical University of Munich) et al, have undertaken a comprehensive review of 210 benchmarks to identify critical technical, epistemic, and sociotechnical shortcomings. Their work is significant because it moves beyond simply creating benchmarks, instead analysing how we benchmark safety, drawing on established risk management principles and measurement theory. This study not only maps the limitations of existing approaches but also provides a roadmap, including a practical checklist, for developing more robust and responsible AI evaluation methods, ultimately advancing the science of benchmarking itself.
The research team identified imbalances in construct coverage, where 81% of benchmarks focus solely on predefined risks, neglecting emergent behaviours and unforeseen failures.
This study establishes that current risk quantification often lacks probabilistic rigor, with 79% of benchmarks relying on binary pass/fail rates instead of calibrated probabilities and severity assessments. The work unveils a concerning trend of eroding measurement validity through proxy chains, where metrics like refusal rates are mistakenly equated with real-world outcomes.
Researchers argue that adhering to established risk management principles, mapping the scope of measurable and unmeasurable factors, and developing robust probabilistic metrics are crucial for improving benchmark validity and usefulness. This approach draws from engineering sciences and long-established theories of risk and safety to address limitations in AI safety evaluation.
This study advances the science of benchmarking by providing a roadmap for improvement, illustrated through both quantitative and qualitative evaluation. The team proposes ten recommendations, grounded in established bodies of knowledge, to address the identified shortcomings in construct coverage, risk quantification, and measurement validity.
A practical checklist is also introduced to assist researchers and practitioners in developing robust and epistemologically sound safety benchmarks. Experiments show that operationalizing safety benchmarking as a normative process, connecting abstract values to real-world outcomes, is essential for responsible AI system deployment.
The research establishes a clear need to shift from solely maximizing capability to prioritizing risk mitigation in AI evaluation, acknowledging the normative and sociotechnical dimensions of safety. This work opens avenues for more comprehensive and reliable AI safety assessments, ultimately contributing to the development and deployment of AI systems more responsibly.
Quantifying Benchmark Reliability Using Probabilistic Risk Assessment offers a robust methodology
Researchers conducted a comprehensive review of 210 benchmarks to identify common challenges in the field, meticulously documenting failures and limitations through the lens of engineering sciences and established theories of risk. The study pioneered a novel approach by applying risk management principles to benchmark design, specifically mapping the measurable space and developing robust probabilistic metrics.
This involved translating abstract values, such as “harm” or “vulnerability”, into concrete observable phenomena to bridge the gap between normative concepts and physical reality. Scientists harnessed probability theory to address real-world complexity and uncertainty, mirroring functional safety practices where acceptable risk thresholds are defined as target probabilities of dangerous failure.
The work quantified and qualified the likelihood and consequence of events, aiming to reduce uncertainty and ensure systems operate within socially accepted bounds. Researchers then developed a benchmark design checklist and deployed it alongside an assessment of deployment risk, operationalizing safety benchmarking as a normative process connecting abstract values to real-world outcomes.
Complete coding results supporting these evaluations are reported in a supplementary appendix. The study distinguished AI safety benchmarks from traditional evaluations by focusing on risk mitigation rather than task proficiency, a fundamental shift in perspective. Traditional benchmarks employ fixed test sets and holdout methods for comparable evaluation, often aggregating performance into a single score and utilizing leaderboards to encourage competition.
In contrast, safety benchmarks are normative, assessing potential for harm rather than simply measuring performance, exemplified by comparing GPT-5 and GPT-2’s capabilities in relation to harmful outputs. This research highlights that safety is a sociotechnical phenomenon, emerging from interactions between systems, users, and contexts, necessitating evaluations beyond purely technical objectives.
AI benchmark characteristics diverge between capability and safety evaluations, often prioritizing performance over robustness
Scientists conducted a comprehensive review of 210 benchmarks, identifying critical technical, epistemic, and sociotechnical shortcomings within the field. The research mapped common challenges in benchmarking, documenting failures and limitations by drawing upon established theories of risk and engineering sciences.
Results demonstrate a need for adhering to established risk management principles to improve benchmark validity and usefulness. Experiments revealed that traditional benchmarks historically link to maximizing capability and focus on technical objectives reflecting technological advancement. However, AI safety benchmarks fundamentally differ by concentrating on risk mitigation rather than task proficiency.
The team measured this distinction by observing that safety benchmarks are normative, assessing potential for harm, unlike traditional benchmarks which measure performance. Data shows GPT-5, while technically superior to GPT-2 in coherence and knowledge, may be judged worse normatively due to its capacity for harmful outputs.
Researchers applied measurement theory to ensure epistemologically sound construct definitions, traceable calibration, and deployment-grounded proxies. Quantitative and qualitative illustrations were developed for translating benchmark scores to deployment risk, detailed in Appendix C, and a benchmark design checklist was created, found in Appendix D.
These tools operationalize safety benchmarking as a normative process connecting abstract values to real-world outcomes, with complete coding results reported in Appendix E0.2. The study highlights that risk measurement functions as a two-step process, bridging abstract social values and physical reality, and employing probability theory to manage uncertainty.
Scientists found that robust safety benchmarks must connect normative values to real-world indicators and handle uncertainty probabilistically. Analyses of benchmarks like TruthfulQA, MACHIAVELLI, and HarmBench revealed a critical gap: they often fail to establish a clear connection between claimed measurements and actual captured values. Current benchmarks rely on metrics, refusal rates, keyword matching, attack success, that differ significantly from the manifestation of harm.
Addressing systemic flaws in artificial intelligence safety evaluation requires interdisciplinary collaboration and rigorous testing
Scientists have identified significant shortcomings in current AI safety benchmarks, revealing limitations in their technical, epistemic, and sociotechnical foundations. A review of 210 benchmarks demonstrates that these evaluations often provide an incomplete and unreliable basis for assessing deployment safety, lacking scientific rigor and failing to adequately address real-world hazards.
The research highlights gaps in risk coverage, a failure to probabilistically quantify dangers, and misalignment with established measurement theory, alongside a neglect of the complex sociotechnical systems within which safety is embedded. This study advances the science of benchmarking by advocating for the integration of risk management principles, a clear mapping of measurable scope, robust probabilistic metrics, and the efficient application of measurement theory.
Researchers suggest that acknowledging the limitations of existing benchmarks and embracing a system-level perspective is crucial for building truly safe AI systems. The authors acknowledge that proposed severity scales and safety margins could be prematurely standardized or used for superficial compliance, emphasizing the need for iterative validation, transparent calibration, and open methodologies. Future work should focus on developing domain-specific treatments tailored to the unique epistemic structures of different AI safety subcategories and exploring efficient, iterative community involvement in benchmark design.
👉 More information
🗞 How should AI Safety Benchmarks Benchmark Safety?
🧠 ArXiv: https://arxiv.org/abs/2601.23112
