The pursuit of truly intelligent machines demands increasingly sophisticated ways to evaluate semantic understanding, and current benchmarks often fall short of capturing the nuances of human cognition. To address this challenge, Samarth Goel, Reagan J. Lee, and Kannan Ramchandran, all from the University of California, Berkeley, introduce SAGE, a rigorous new benchmark designed to comprehensively assess both embedding and classical similarity metrics. SAGE moves beyond isolated capability testing by evaluating semantic understanding under realistic, adversarial conditions, utilising noisy transformations and nuanced human judgements across a diverse set of over thirty datasets. This research reveals significant performance gaps in current approaches, demonstrating that no single method excels across all dimensions of semantic understanding and exposing critical trade-offs between capabilities like clustering performance and robustness, ultimately providing a more realistic assessment for real-world deployment of these technologies.
Robustness of Embeddings and Retrieval Systems
This research investigates the robustness of text embeddings and retrieval systems, moving beyond standard evaluations to assess performance under challenging conditions. The team evaluated how well these systems function with noisy or corrupted text, and when subjected to adversarial attacks designed to mislead them. The study involved a comprehensive evaluation across five key tasks, including measuring textual similarity, assessing retrieval accuracy with corrupted data, evaluating the quality of text clusters, and testing retrieval performance with noisy inputs. The researchers utilized a diverse set of datasets, including BEIR, MS MARCO, FEVER, TREC-COVID, ArguAna, CQADupStack, TwentyNewsgroups, and Reddit.
SAGE Benchmark Probes Semantic Understanding and Robustness
Researchers introduced SAGE, a rigorous benchmark designed to comprehensively assess semantic understanding in language models and similarity metrics. Unlike existing benchmarks, SAGE evaluates performance under adversarial conditions, employing noisy transformations and nuanced human judgment tasks across over 30 datasets. This provides a more challenging and realistic evaluation framework, probing specific aspects of semantic understanding beyond simple accuracy. To assess how well systems handle imperfect data, the team systematically perturbed documents with various corruptions, including character-level noise and semantic alterations, then measured how consistently similarity scores reflected semantic equivalence.
They also evaluated how effectively metrics detected changes in content, either through the insertion of irrelevant information or the removal of key content spans, focusing on proportional decreases in similarity scores. Clustering performance was determined using agglomerative clustering on a large text embedding benchmark, measuring the quality of groupings to assess the preservation of semantic structure. The Retrieval Robustness task involved creating adversarially augmented corpora by applying transformations and generating perturbed versions of each document, then measuring the retention of relevant results. The team evaluated popular text embedding models, alongside classical similarity metrics, using cosine similarity for embedding models. Scores across all tasks were normalized, and an unweighted average was used to calculate an overall SAGE score. Results demonstrate that OpenAI’s text-embedding-3-large achieved the highest overall score, with embedding models generally outperforming classical metrics in tasks requiring deep semantic understanding, but classical metrics exhibiting advantages in certain areas.
SAGE Benchmark Reveals Nuanced Semantic Understanding Limits
This research introduced SAGE, a new benchmark designed to rigorously evaluate semantic understanding in both embedding models and classical metrics. Unlike existing benchmarks, SAGE assesses performance under adversarial conditions, noisy transformations, and nuanced human judgment tasks, utilizing over 30 datasets to provide a comprehensive evaluation. Results demonstrate a critical finding: no single approach excels across all dimensions of semantic understanding, revealing nuanced performance trade-offs. Among the models tested, OpenAI’s text-embedding-3-large achieved the highest overall SAGE score, followed closely by models from Google and Voyage.
While embedding models substantially outperform classical similarity metrics overall, classical metrics demonstrate strengths in specific areas, notably Jaccard Similarity achieving a high score on information sensitivity tasks. The study uncovered significant trade-offs, with one embedding model achieving the highest clustering performance but simultaneously recording the lowest transformation robustness score. Embedding models generally excel in tasks requiring deep semantic understanding, such as aligning with human preferences and retrieval, while classical metrics demonstrate advantages in information sensitivity. This research reveals a disconnect between benchmark performance and real-world readiness, as even the most robust approach tested achieved only limited effectiveness under noisy conditions.
Semantic Evaluation Reveals Performance Trade-offs and Noise Sensitivity
This research introduces SAGE, a new benchmark designed to rigorously evaluate semantic understanding in both embedding models and classical metrics. The team demonstrates that current evaluation methods often fail to capture critical performance trade-offs, revealing significant discrepancies between scores achieved in controlled laboratory settings and those likely to be seen in real-world applications. Results show that while state-of-the-art embedding models excel at aligning with human preferences, classical metrics often outperform them in tasks requiring sensitivity to information. The study highlights a crucial limitation: even the most robust approaches can still fail over 60% of the time when exposed to realistic noise, suggesting that deploying these models without careful safeguards is premature for many applications.
SAGE uncovers that selecting models solely on aggregate scores can be misleading, as models may exhibit extreme brittleness under even minor perturbations. The authors acknowledge that future evaluations should incorporate a wider range of real-world corruptions, greater data diversity, and practical constraints like latency and memory limitations. They hope SAGE will encourage a more balanced and rigorous approach to evaluating semantic technologies, prompting practitioners to view published scores as upper bounds achievable only under ideal conditions and to implement defensive architectures accordingly.
👉 More information
🗞 SAGE: A Realistic Benchmark for Semantic Understanding
🧠 ArXiv: https://arxiv.org/abs/2509.21310
