Sage Benchmark Reveals Performance Gaps, Achieving 0.905 Alignment and 0.682 Robustness in Semantic Understanding

The pursuit of truly intelligent machines demands increasingly sophisticated ways to test their understanding of language, and current benchmarks often fail to capture the nuances of semantic meaning. To address this challenge, Samarth Goel, Reagan J. Lee, and Kannan Ramchandran, all from the University of California, Berkeley, introduce SAGE, a new benchmark designed to rigorously evaluate both modern embedding models and traditional similarity metrics. SAGE assesses semantic understanding through a diverse range of adversarial conditions and human judgment tasks, spanning over thirty datasets and five key categories. This comprehensive evaluation reveals significant performance gaps across different approaches, demonstrating that while state-of-the-art embedding models excel at aligning with human preferences, they often struggle with tasks requiring sensitivity to information, and exhibit surprising brittleness when faced with noisy data. By exposing these critical limitations and trade-offs, SAGE provides a more realistic and challenging assessment of semantic understanding, paving the way for the development of more robust and reliable artificial intelligence systems.

SAGE Benchmark Probes Holistic Semantic Understanding

This study introduces SAGE, a rigorous benchmark designed to comprehensively assess semantic understanding in language models and similarity metrics across five key categories: Human Preference Alignment, Transformation Robustness, Information Sensitivity, Clustering Performance, and Retrieval Robustness. Unlike existing benchmarks, SAGE evaluates performance under adversarial conditions, employing noisy transformations and nuanced human judgment tasks across over 30 datasets, providing a more realistic assessment of model capabilities. To assess Transformation Robustness, the team systematically perturbed documents with corruptions, including character-level noise and semantic alterations, measuring how consistently similarity scores reflected semantic equivalence despite these changes. Information Sensitivity was evaluated by quantifying how effectively metrics detected changes in document content, either through insertion of irrelevant information or removal of key content, with scores reflecting proportionality between perturbation and similarity decrease.

Scores were normalized to a 0-1 scale, with an overall SAGE score calculated as the unweighted average of the five category scores, demonstrating that embedding models generally outperform classical metrics in tasks requiring deep semantic understanding, while classical metrics exhibit advantages in information sensitivity and transformation robustness. Notably, Jaccard Similarity achieved a score of 0. 905 in information sensitivity, surpassing the top embedding score of 0. 794, while Levenshtein Ratio led in transformation robustness with a score of 0. 333, revealing critical trade-offs.

SAGE Benchmark Reveals Nuanced Semantic Understanding Performance

The research team introduced SAGE, a new benchmark designed to rigorously evaluate semantic understanding in both embedding and classical similarity metrics, assessing performance across five key categories: Human Preference Alignment, Transformation Robustness, Information Sensitivity, Clustering Performance, and Retrieval Robustness, utilizing over 30 datasets to provide a comprehensive evaluation. Results demonstrate that no single approach excels across all dimensions of semantic understanding, revealing nuanced performance trade-offs often missed by simpler benchmarks. Among embedding models, OpenAI’s text-embedding-3-large achieved the highest overall SAGE score of 0. 524, followed closely by gemini-embedding-001 at 0.

504 and voyage-3-large at 0. 492. While embedding models substantially outperform classical metrics overall, with text-embedding-3-small scoring 0. 474 compared to the best classical approach of Jaccard Similarity at 0. 423, classical metrics demonstrate strengths in specific areas.

Notably, Jaccard Similarity achieved a score of 0. 905 in Information Sensitivity, exceeding the top embedding score of 0. 794. The study uncovered significant trade-offs within embedding models; text-embedding-3-small achieved the highest clustering performance at 0. 483, but simultaneously recorded the lowest transformation robustness score of 0. 011, highlighting a disconnect between benchmark performance and real-world readiness. The research team found that even the most robust approach retained only 67% effectiveness, demonstrating the need for defensive architectures and safeguards in critical applications, demanding a shift toward benchmarks that mirror production complexity, incorporating real-world corruptions and data diversity for more accurate evaluation.

SAGE Benchmark Reveals Semantic Understanding Limits

This research introduces SAGE, a new benchmark designed to rigorously evaluate semantic understanding in both embedding models and classical metrics, assessing performance across five key categories: Human Preference Alignment, Transformation Robustness, Information Sensitivity, Clustering Performance, and Retrieval Robustness. Results demonstrate that current evaluation methods often fail to capture critical performance trade-offs, revealing significant discrepancies between performance on standard benchmarks and performance under more realistic, challenging conditions. The study highlights a crucial limitation: even the most robust approaches can fail significantly when deployed in noisy environments, underscoring the need for robust architectures and safeguards in critical applications.

👉 More information
🗞 SAGE: A Realistic Benchmark for Semantic Understanding
🧠 ArXiv: https://arxiv.org/abs/2509.21310

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Satellite Data Speeds Boosted by New Signal Compression Technique

Satellite Data Speeds Boosted by New Signal Compression Technique

February 4, 2026
Nativetok Reveals 2-Stage Pipeline Benefits Via Causal Token Dependencies

Nativetok Reveals 2-Stage Pipeline Benefits Via Causal Token Dependencies

February 4, 2026
Reveals Universal Adversarial Perturbations for MLLMs with Transferable Attacks across Inputs

Reveals Universal Adversarial Perturbations for MLLMs with Transferable Attacks across Inputs

February 4, 2026