Efficiently managing and accessing data remains a critical challenge in modern computing, and hash tables are fundamental to solving this problem, yet existing designs for graphics processing units often struggle with the demands of concurrent access and fluctuating data volumes. Md Sabbir Hossain Polak, David Troendle, and Byunghyun Jang, all from the University of Mississippi, present a new approach with their development of the Hive hash table, a dynamically resizable system designed to overcome these limitations. This innovative hash table achieves high performance through warp-cooperative techniques, allowing multiple processing threads to work together and minimise contention during updates and lookups. The team’s design sustains remarkably high load factors, up to 95%, and delivers significantly improved throughput, achieving up to two times the performance of existing GPU hash tables, demonstrating a substantial advance for GPU-accelerated data processing and unlocking new possibilities for data-intensive applications.

Hive hash table represents a high-performance, warp-cooperative and dynamically resizable GPU hash table that adapts to varying workloads without global rehashing. This work makes three key contributions: a cache-aligned packed bucket layout storing key-value pairs as 64-bit words, enabling coalesced memory access and atomic updates; warp-synchronous concurrency protocols, Warp-Aggregated-Bitmask-Claim (WABC) and Warp-Cooperative Match-and-Elect (WCME), reducing contention to one atomic operation per warp while ensuring lock-free progress; and a load-factor-aware dynamic resizing strategy expanding or contracting capacity in warp-parallel batches.

GPU Hash Table With Warp-Synchronous Operations

This document details the design and performance of Hive hash table, a new GPU-based hash table implementation prioritizing high throughput and scalability through a warp-synchronous design, lock-free operations, and a dynamic resizing strategy. Key features include leveraging warp-level primitives for efficient synchronization and data access, minimizing contention, and a claim-then-commit technique to reduce contention during insertions. Bounded eviction limits the cost of handling collisions and evictions, while dynamic resizing adapts to changing data sizes with a fast resizing mechanism. A stash fallback handles overflow insertions when no vacant slots are available.

Experiments demonstrate that Hive hash table achieves up to four times higher throughput compared to existing GPU hash table implementations and maintains performance up to a 95% load factor. Dynamic resizing is three to four times faster than previous approaches, and insertion overhead remains low until near saturation. The implementation, written in CUDA C++, employs a combination of open addressing and a stash for overflow insertions, relies on warp-level primitives for efficient synchronization, and utilizes universal hashing (MurmurHash, CityHash) and CRC for collision resolution. This research delivers a novel GPU hash table design prioritizing throughput and scalability, a fast and efficient dynamic resizing strategy, and a detailed performance evaluation demonstrating significant improvements over existing approaches. The authors plan to release the Hive hash table library and benchmark suite as open source.

Hive Hash Table Boosts GPU Throughput

The research team has developed Hive hash table, a new approach to data storage on GPUs that significantly improves performance for data-intensive applications. Overcoming limitations in existing GPU hash table implementations, Hive hash table focuses on warp-level cooperation and dynamic resizing. The core of this achievement lies in a cache-aligned packed bucket layout, storing key-value pairs as 64-bit words to enable efficient memory access and atomic updates using single-cycle operations. Experiments on an RTX 4090 GPU demonstrate that Hive hash table sustains load factors up to 95%, delivering 1.

5 to 2times higher throughput than state-of-the-art GPU hash tables like Slab-Hash, DyCuckoo, and WarpCore under mixed workloads involving insertions, deletions, and lookups. Specifically, the team measured a peak performance of 3. 5 billion updates per second and nearly 4 billion lookups per second, showcasing the scalability and efficiency of the new approach for GPU-accelerated data processing. This breakthrough is achieved through a four-step insertion strategy, separating lock-free fast paths from bounded synchronization phases, and a dynamic resizing strategy that expands or contracts capacity in warp-parallel batches.

The researchers developed two key synchronization protocols: Warp-Aggregated-Bitmask-Claim (WABC) and Warp-Cooperative Match-and-Elect (WCME). WABC reduces contention by allowing a single warp to claim slots in constant time, while WCME ensures that only one lane performs critical updates, eliminating redundant traffic and maximizing GPU utilization. Analysis of hash function performance revealed that CRC functions consistently achieve excellent spread with a Collision Speedup Ratio of approximately 1 across all scales, while BitHash variants offer a lightweight trade-off between entropy and throughput. These advancements collectively deliver a highly efficient and scalable hash table implementation for modern GPUs, paving the way for faster and more responsive data-intensive applications.

GPU Hash Table Performance with Hive

The research team presents Hive hash table, a new approach to data storage and retrieval on graphics processing units (GPUs). This work addresses limitations in existing GPU hash table implementations, specifically challenges with concurrent updates, high load factors, and irregular memory access. The team’s design achieves significant performance gains through a combination of techniques, including a cache-aligned packed bucket layout, warp-synchronous concurrency protocols, and a load-factor-aware dynamic resizing strategy. Experimental results demonstrate that Hive hash table sustains load factors up to 95% while delivering 1.

5 to 2times higher throughput than state-of-the-art GPU hash tables under mixed workloads. On balanced workloads, the system achieves update rates of 3. 5 billion per second and lookup rates approaching 4 billion per second, demonstrating substantial scalability and efficiency for GPU-accelerated data processing. Profiling reveals that most insertions complete rapidly and eviction costs remain bounded, validating the efficiency of the concurrency pipeline. The authors recommend initiating table expansion at a load factor of 0. 9 to maintain optimal performance, as performance degradation occurs when nearing full capacity with overflow handling becoming a dominant factor. The team intends to release the Hive hash table library and benchmark suite as open source to encourage reproducibility and further research in GPU data structures.

👉 More information
🗞 Hive Hash Table: A Warp-Cooperative, Dynamically Resizable Hash Table for GPUs
🧠 ArXiv: https://arxiv.org/abs/2510.15095

Tags:

atomic CAS operations cuckoo eviction GPU hash table linear hashing lock-free progress overflow-stash fallback RTX 4090 warp-aggregated-bitmask-claim warp-cooperative concurrency warp-cooperative match-and-elect

Hive Hash Table Achieves 95% Efficiency with Warp-cooperative, Dynamically Resizable GPU Data Storage

GPU Hash Table With Warp-Synchronous Operations

Hive Hash Table Boosts GPU Throughput

GPU Hash Table Performance with Hive

Rohail T.

Latest Posts by Rohail T.:

AI Swiftly Answers Questions by Focusing on Key Areas

Machine Learning Sorts Quantum States with High Accuracy

Framework Improves Code Testing with Scenario Planning