Modern information retrieval systems increasingly rely on learned multivector representations to achieve high accuracy, yet practical implementation is hampered by the computational expense of searching through extensive token-level indexes. Silvio Martinico, Franco Maria Nardini, Cosimo Rulli, et al. from ISTI-CNR and the University of Pisa address this challenge by investigating the inefficiencies of current ‘gather-and-refine’ strategies. Their research demonstrates that replacing the costly token-level gathering phase with a learned sparse retriever significantly reduces the candidate set while maintaining semantic coherence, effectively transforming the pipeline into a more efficient two-stage retrieval process. By integrating recent advances in inference-free sparse retrieval and introducing novel optimisation techniques, the team achieves substantial speedups , up to 1.8times faster , without compromising retrieval quality. This work offers a compelling pathway towards deploying high-performance multivector retrieval in real-world applications by balancing efficiency, memory usage, and overall effectiveness.

Modern search relies on learned multivector representations for strong retrieval performance, but exhaustive token-level retrieval proves costly in real-world applications. The research team tackled this challenge by reproducing state-of-the-art multivector retrieval techniques on two public datasets, clearly illustrating the inefficiencies inherent in token-level gathering strategies. This initial work established a baseline understanding of the field and highlighted the need for a more efficient approach to candidate selection.

Building on this foundation, the study unveils a novel approach that replaces the expensive token-level gather phase with a single-vector document retriever, specifically a learned sparse retriever (LSR). This innovative substitution recasts the retrieval pipeline into a well-established two-stage retrieval architecture, streamlining the process and reducing computational demands. Experiments show that as retrieval latency decreases, query encoding with dual encoders becomes the primary bottleneck, prompting the integration of recent inference-free LSR methods. These methods preserve retrieval effectiveness while substantially reducing query encoding time, further optimizing the system’s performance.

The research establishes that this two-stage approach achieves a speedup exceeding 24x compared to existing state-of-the-art multivector retrieval systems, all while maintaining comparable or even superior retrieval quality. Scientists investigated multiple reranking configurations, carefully balancing efficiency, memory usage, and overall effectiveness. They also introduced two optimization techniques designed to prune low-quality candidates early in the process, improving retrieval efficiency by up to 1.8x without compromising quality. Detailed analysis reveals that token-level gathering suffers from high computational costs and a lack of selectivity, often returning an excessive number of candidates for full scoring.

By shifting to a document-oriented gather strategy using learned sparse representations, the team achieved a more semantically coherent candidate set and drastically reduced retrieval costs. Results, as illustrated in accompanying figures, demonstrate a substantial recall gap between traditional BM25 methods and the learned sparse representation, highlighting the benefits of the new approach. This work opens avenues for developing more scalable and effective information retrieval systems capable of handling large datasets with improved speed and accuracy.

Learned Sparse Retrieval for Efficient Multivector Search

The study addressed limitations in modern multivector retrieval systems, specifically the computational cost of exhaustive token-level retrieval. Researchers initially reproduced results from several state-of-the-art multivector methods using two publicly available datasets to establish a baseline understanding of current performance and identify inefficiencies in existing gather-and-refine strategies. This reproducibility work highlighted the substantial cost associated with token-level gathering, which necessitates searching over large indexes and frequently overlooks highly relevant documents. To overcome these challenges, the team pioneered a two-stage retrieval architecture, replacing the token-level gather phase with a learned sparse retriever (LSR) operating on single-vector document representations.

This innovative approach generates a smaller, more semantically coherent candidate set, effectively recasting the pipeline into a well-established two-stage framework. As retrieval latency decreased, the study identified query encoding with dual neural encoders as the primary computational bottleneck. Scientists integrated recent inference-free LSR methods to preserve retrieval effectiveness while significantly reducing query encoding time, demonstrating a crucial optimisation for real-world applications. Further refinement involved investigating multiple reranking configurations to balance efficiency, memory usage, and retrieval quality.

The research introduced two novel optimisation techniques designed to prune low-quality candidates early in the process, improving retrieval efficiency by up to 1.8times without compromising accuracy. Empirical results demonstrate that this two-stage approach achieves over a 24-fold speedup compared to existing state-of-the-art multivector retrieval systems, while maintaining comparable or superior retrieval quality. This work showcases a significant advancement in retrieval methodology, enabling more efficient and effective information access.

Learned Sparse Retrieval Boosts Information Access

Scientists achieved a significant breakthrough in information retrieval systems by fundamentally altering the initial candidate selection process. The research team demonstrated that replacing a token-level gather phase with a single-vector document retriever, specifically a learned sparse retriever (LSR), dramatically reduces computational cost while maintaining retrieval quality. Experiments revealed that this new two-stage approach bypasses the inefficiencies inherent in existing ‘gather-and-refine’ strategies commonly used with multivector retrieval methods. This recasting of the pipeline leverages the strengths of established two-stage retrieval techniques, offering a more streamlined and effective solution.

The study meticulously reproduced results from state-of-the-art multivector retrieval methods on the Ms Marco-v1.2 and Ms Marco-v1 datasets, clearly illustrating the inefficiencies of token-level gathering. Data shows that current systems struggle with scalability due to the expanding index cardinality associated with token-level searches, which is often one or two orders of magnitude larger than the number of documents. Replacing this with a document-oriented approach significantly reduces the search space, leading to substantial performance gains. Measurements confirm that the team’s method achieves over a 24x speedup compared to existing multivector systems, all while preserving comparable or superior retrieval quality.

Sparse Retrieval Boosts Multivector Pipeline Efficiency

Recent advances in multivector representations have significantly improved retrieval effectiveness, however, practical application is hampered by the computational expense of exhaustive token-level searches. Researchers reproduced several state-of-the-art multivector retrieval methods to comprehensively assess the field and identified inefficiencies inherent in token-level gathering strategies. This work demonstrates that replacing token-level gathering with a document-level approach, utilising learned sparse retrieval, creates a more efficient two-stage retrieval pipeline. The investigation reveals that this revised pipeline requires substantially fewer candidates for reranking, between 20 and 50, to achieve comparable or improved effectiveness compared to existing gather-based systems, while also markedly reducing query latency.

Further analysis explored quantization techniques and pruning strategies, confirming that quantization effectively reduces memory usage and adaptive reranking consistently enhances efficiency. The resulting approach achieves speed-ups of up to 24times over current state-of-the-art multivector retrieval systems. The authors acknowledge that their analysis does not fully decompose the internal costs of gather and refine phases, representing a limitation of the study. Future work could focus on a more granular breakdown of these costs to further optimise performance. Additionally, while the presented methods demonstrate strong performance, the benefits of certain techniques, such as joint multi-probe quantization, require supervised training data. This highlights a potential area for future research exploring unsupervised or self-supervised approaches to enhance generalizability.

👉 More information
🗞 Multivector Reranking in the Era of Strong First-Stage Retrievers
🧠 ArXiv: https://arxiv.org/abs/2601.05200

Tags:

dual-encoder pipeline gather-and-refine inference-free LSR learned sparse retriever multivector retrieval reranking configurations. token-level retrieval two-stage retrieval

Multivector Reranking Achieves Superior Retrieval, Reducing Costs Beyond Token-Level Indexes

Learned Sparse Retrieval for Efficient Multivector Search

Learned Sparse Retrieval Boosts Information Access

Sparse Retrieval Boosts Multivector Pipeline Efficiency

Rohail T.

Latest Posts by Rohail T.:

Protected: Models Achieve Reliable Accuracy and Exploit Atomic Interactions Efficiently

Protected: Quantum Computing Tackles Fluid Dynamics with a New, Flexible Algorithm

Protected: Silicon Unlocks Potential for Long-Distance Quantum Communication Networks