The increasing demand for rapid data processing fuels a need for efficient data structures on modern graphics processing units, but current GPU hash tables often lack the features required for large-scale applications. Hunter McCoy and Prashant Pandey, from Northeastern University, address this challenge with WarpSpeed, a new library of high-performance concurrent GPU hash tables. This work introduces a unified benchmarking framework alongside implementations of eight state-of-the-art designs, offering a rich application programming interface for developers. By implementing novel optimization techniques to reduce concurrency overhead and demonstrating real-world impact through integration into three downstream applications, WarpSpeed provides both new insights into concurrent GPU hash table design and practical guidance for building efficient, scalable data structures.
Hash Table Collision Resolution Strategies Explained
Scientists meticulously compiled and analyzed research papers focused on hash table technology, encompassing studies exploring various collision resolution techniques and their application in modern computing systems. This compilation spans approaches from foundational work to innovative designs for high-performance parallel processing, including cuckoo hashing, optimizations for DNA k-mer counting, and GPU-based sparsification techniques. Further studies focused on high-performance distributed memory parallel hash tables and counting filters for efficient data management. The collection also includes research on massively parallel multi-GPU architectures, efficient join algorithms for large database tables, and performance characteristics of GPUs and CPUs for database analytics. Studies investigated sparse tensor representations and tools, alongside techniques for orchestrating data placement and query execution in heterogeneous CPU-GPU systems, dynamic hash tables on GPUs, and unified memory models for heterogeneous computing.
Concurrent GPU Hash Tables and Benchmarking
Scientists developed WarpSpeed, a comprehensive library of high-performance concurrent GPU hash tables, and a unified benchmarking framework to rigorously assess the correctness and scalability of eight state-of-the-art designs. Researchers engineered an adversarial workload to verify hash table correctness under concurrent conditions, revealing the essential need for external synchronization mechanisms for concurrent insertions and deletions. To minimize concurrency overhead, the team implemented fingerprint-based metadata, a compact key representation previously used in CPU hash tables, reducing cache line probes during operations. Scientists harnessed GPU vector loads to enable lock-free queries with only 1% overhead, and demonstrated that stability, guaranteeing keys remain in their initial location after insertion, is crucial for efficient downstream applications, improving performance by over 10x by eliminating locking requirements.
Experiments revealed that IcebergHT, P2HT, and DoubleHT achieve the fastest insertion speeds, exceeding alternatives by 21%, while DoubleHT excels in query performance with up to 20% improvement and CuckooHT leads in deletion speeds by up to 31%. Researchers quantified the performance impact of concurrency support, finding overhead ranging from 1% to 25% compared to single-processor performance. Through careful tuning of tile and bucket sizes, the team achieved over a 1300% performance increase, demonstrating the significant impact of memory access patterns. Metadata-enabled tables exhibited up to 29% speedup in aging and caching workloads, while DoubleHT and P2HT demonstrated up to 50% faster performance on sparse tensor contractions due to their stability and adaptability.
Fast Concurrent GPU Hash Tables
Scientists have developed WarpSpeed, a library of high-performance concurrent GPU hash tables, and a benchmarking framework to analyze their performance. This work implements eight state-of-the-art hash table designs, providing a rich API for modern GPU applications and delivering new insights into concurrent GPU hash table design. Experiments reveal that IcebergHT, P2HT, and DoubleHT achieve the fastest insertion speeds, exceeding alternatives by over 21 percent, while DoubleHT demonstrates the fastest query performance, up to 20 percent faster than other designs. CuckooHT achieves the fastest deletion speeds, up to 31 percent faster than alternatives.
The team quantified the performance overhead associated with full concurrency, finding it incurs between 1 and 25 percent overhead compared to bulk synchronous parallel execution. Detailed analysis of tile and bucket size choices demonstrates a significant impact on performance, with the best configurations achieving over 1300 percent higher throughput. Leveraging GPU vector loads, scientists enabled lock-free queries with only 1 percent overhead. Introducing a fingerprint-based metadata scheme improves performance at high load factors and triples negative query performance. Measurements confirm that metadata-enabled tables perform best under aging and caching workloads, achieving up to 29 percent speedup.
DoubleHT and P2HT, benefiting from stability, are up to 50 percent faster on sparse tensor contractions. DoubleHT and DoubleHT with metadata optimization achieve the highest performance on YCSB workloads, optimized for high load factors. These findings guide the development of new optimizations, such as lock-free queries and fingerprint metadata, to design fully concurrent GPU hash tables more efficiently. The research provides a clear understanding of the current GPU hash table landscape and offers valuable insights for developing numerous other GPU-based data structures.
Concurrent GPU Hash Tables Outperform Alternatives
This research presents WarpSpeed, a library of high-performance concurrent GPU hash tables, and a framework for evaluating their designs across diverse workloads. The team implemented and assessed eight state-of-the-art hash table designs, demonstrating the importance of external synchronization for supporting concurrent insertions and deletions. Results indicate that DoubleHT and P2HT(M) consistently achieve superior performance compared to other designs, particularly when incorporating metadata variants for aging and caching applications. The study quantifies the overhead associated with concurrency and introduces optimizations, including fingerprint-based metadata and GPU-specific instructions for lock-free queries, which significantly reduce this cost. While concurrency introduces some overhead compared to traditional bulk synchronous parallel models, these enhancements enable stable, concurrent, and composable operations critical for modern data processing. This work provides new insights into the design of scalable, concurrent GPU data structures and offers practical guidance for developers building high-throughput applications.
👉 More information
🗞 WarpSpeed: A High-Performance Library for Concurrent GPU Hash Tables
🧠 ArXiv: https://arxiv.org/abs/2509.16407
