Researchers present E²R-FLOPs, new metrics assessing the efficiency of large language model rerankers by measuring relevance and queries per PetaFLOP, offering hardware-independent evaluation. An accompanying FLOPs estimator predicts computational cost without experimentation, enabling comprehensive analysis of the efficiency-effectiveness trade-off.
Large language models (LLMs) increasingly feature in information retrieval systems, specifically as re-rankers that refine initial search results, yet their substantial computational requirements present challenges for practical implementation. Current evaluations often rely on proxy metrics like latency and token counts, which are susceptible to variations in hardware and runtime configurations, hindering meaningful comparisons and obscuring the efficiency-effectiveness trade-off.
This research introduces E2R-FLOPs, a novel evaluation framework centred on ranking metrics per PetaFLOP (RPP) for relevance per compute and queries per PetaFLOP (QPP) for throughput, providing a hardware-agnostic assessment of LLM efficiency and enabling a more direct comparison of models regardless of implementation details. The authors also present an interpretable FLOPs (Floating Point Operations) estimator, allowing for the prediction of computational cost without requiring experimental runs.
Through comprehensive experimentation with a diverse range of LLM architectures, the study investigates the relationship between efficiency and effectiveness, revealing the variability in computational cost across different models and highlighting the importance of considering FLOPs as a primary metric for evaluating re-ranking performance. The researchers systematically analyse the FLOPs required for various LLM operations, providing insights into the computational bottlenecks and potential areas for optimisation.
The research emphasises the need for a standardised evaluation framework that accounts for computational cost, facilitating a more informed understanding of the efficiency-effectiveness trade-off in LLM-based information retrieval. By introducing E2R-FLOPs and a FLOPs estimator, the study provides the research community with valuable tools for assessing and comparing LLM re-rankers, ultimately improving the scalability and accessibility of advanced search technologies.
Future work should investigate methods for reducing the FLOPs required by LLM-based re-rankers without sacrificing retrieval accuracy, including exploring model compression techniques, such as pruning and quantisation, as well as developing more efficient attention mechanisms. Additionally, research could focus on optimising the FLOPs estimator to improve its accuracy and generalisability across different LLM architectures and datasets. Expanding the evaluation framework to encompass energy consumption alongside FLOPs represents a crucial next step, as considering both computational cost and energy efficiency will provide a more holistic understanding of the sustainability implications of deploying LLM-based information retrieval systems.
Information retrieval systems commonly employ a two-stage process, initially retrieving a large set of documents and then refining their ranking to prioritise relevance. Recent advances utilise LLMs as re-rankers, demonstrably improving ranking quality as measured by metrics like normalised discounted cumulative gain (NDCG). However, these LLM-based re-rankers introduce significant computational demands, creating challenges for real-world deployment at scale and necessitating a more nuanced evaluation of re-ranking systems considering both effectiveness and computational cost.
Existing methods for assessing efficiency, such as latency, the number of LLM calls, and token usage, prove inadequate for precise comparison because latency is heavily influenced by hardware and implementation details, obscuring algorithmic differences, while simply counting LLM calls fails to account for model size, as a larger model requires substantially more computation per call. Token usage, similarly, lacks interpretability, failing to differentiate the computational cost of processing input versus output tokens and hindering a clear understanding of resource allocation.
Inspired by scaling laws observed in LLMs, which link total compute to performance, researchers now focus on floating-point operations (FLOPs) as a fundamental measure of computational cost. FLOPs provide a hardware-agnostic, intrinsic metric of the work performed by a model during re-ranking, allowing for a fairer comparison of different re-ranking methods irrespective of the specific hardware or runtime choices employed.
To facilitate this evaluation, new metrics are proposed, namely ranking metrics per PetaFLOP (RPP) for relevance per compute and queries per PetaFLOP (QPP) for hardware-agnostic throughput. These metrics aim to provide a more comprehensive understanding of the efficiency-effectiveness trade-off in LLM-based re-rankers, guiding development towards more sustainable solutions. Accompanying these metrics is an interpretable FLOPs estimator, enabling researchers to predict the computational cost of a re-ranker without running experiments.
Comprehensive experiments are being conducted to evaluate a range of LLM-based re-rankers, examining the relationship between efficiency and effectiveness and highlighting the importance of considering computational cost alongside ranking quality. This work seeks to address a critical gap in the current evaluation landscape, promoting a more holistic assessment of LLM-based information retrieval systems and fostering innovation in efficient model design. The goal is to encourage the development of more efficient re-ranking algorithms, ultimately improving the scalability and accessibility of advanced search technologies.
A key aspect of the experimental design was the focus on re-ranking specifically, as unlike initial retrieval, where a search engine identifies a broad set of potentially relevant documents, re-ranking takes a smaller, pre-selected set of results and reorders them based on a more nuanced understanding of relevance. This task is particularly well-suited for LLMs, which excel at semantic understanding and contextual reasoning, allowing for a more targeted evaluation of LLM performance. The use of RPP and QPP as primary metrics allows for a hardware-agnostic comparison, meaning that the results are less susceptible to variations in hardware configurations, ensuring the reproducibility and generalisability of the findings.
The authors propose a novel evaluation framework centred around E2R-FLOPs – relevance per PetaFLOP and queries per PetaFLOP – offering hardware-agnostic metrics for assessing LLM-based re-ranker performance. The core contribution lies in the introduction of metrics that directly quantify the computational cost, measured in PetaFLOPs, required to achieve a given level of retrieval relevance or throughput, establishing a standardised measure independent of specific hardware configurations or runtime choices, such as batch size or parallelisation. Furthermore, the authors develop an interpretable FLOPs estimator, enabling prediction of computational demands without necessitating experimental runs, thereby streamlining the evaluation process.
The results underscore the importance of considering computational demands alongside relevance metrics when selecting and deploying LLM-based re-ranking systems, guiding the development of more sustainable and efficient search technologies. Finally, investigating the applicability of E2R-FLOPs to other natural language processing tasks beyond re-ranking warrants further exploration, potentially unlocking new avenues for efficient and sustainable AI development.
👉 More information
🗞 Efficiency-Effectiveness Reranking FLOPs for LLM-based Rerankers
🧠 DOI: https://doi.org/10.48550/arXiv.2507.06223
