Scientists have compressed nine months of X-ray data analysis into less than four hours using a new workflow called XANI, a development that will accelerate materials science research. The breakthrough centers on the processing of massive datasets, up to 42 terabytes in size, generated by facilities like SwissFEL, where experiments can reach up to 1 million X-ray shots per second with 35-million-pixel cameras. The NVIDIA team demonstrated this speedup on 32 NVIDIA GB200 Grace Blackwell Superchips and newly developed cuPyNumeric libraries, including LMFIT and multithreaded Hierarchical Data Format 5. This accelerated analysis preserves the precision of acquired data, allowing researchers to move beyond post-experiment analysis and steer scientific experiments in real time, unlocking insights into materials at the atomic level. Traditional CPU-bound pipelines often processed only 10% of a dataset during an experiment.
XFEL Facilities Enable Ultrafast Material Dynamics Research
The ability to observe atomic movements in real-time has improved as researchers are now able to process data from X-ray free-electron lasers (XFELs) at faster speeds. Facilities like SwissFEL, Spring-8 Angstrom Compact free-electron Laser (SACLA), and others are generating datasets containing rich physical information about the fastest microscopic movements of electrons and atoms, but traditionally, extracting meaningful physics from these experiments required computational time exceeding nine months. This bottleneck is now being overcome by the Accelerated X-ray Analysis for Nanoscale Imaging (XANI) workflow, which has compressed analysis timelines from nearly a year to under four hours. This acceleration relies on a specific hardware and software configuration. Traditional CPU-bound pipelines often required manual parameter tuning and subsampling, processing only 10% of a dataset during an experiment. XANI addresses these challenges through a combination of optimized libraries and a shift towards GPU-centric distributed computing.
Specifically, the team achieved a 165x acceleration in I/O throughput with GPUDirect Storage (GDS), which provides higher throughput than conventional POSIX reads that go through the CPU host, and multithreaded HDF5. “GDS offers a new storage technology that enables data to be read into the GPU bypassing the host CPU and memory,” eliminating bottlenecks associated with data staging. The XANI architecture facilitates migrations from a CPU-orchestrated workflow to a GPU-centric distributed model, enabling live-feedback and automated experimental steering. This is not merely a technical feat; it fundamentally alters the scientific process. By minimizing the time-to-solution for high-resolution X-ray material characterization, XANI allows researchers to interact with experiments in real-time, steering investigations and accelerating discovery across disciplines from quantum physics to materials chemistry. The project’s adoption by diverse research communities underscores the broad applicability of CUDA Python and distributed computing in advancing scientific frontiers.
XANI Workflow Accelerates Data Processing with GB200 Superchips
The increasing generation of data at large-scale X-ray free-electron laser (XFEL) facilities has long been a challenge for materials science research. This lengthy processing time significantly delayed scientific discovery, restricting researchers to post-experiment analysis rather than real-time feedback and experimental steering. The challenge wasn’t a lack of data, but the inability to efficiently extract meaningful physics from the immense volume. Developed by a team including Irina Demeshko, Supun Kamburugamuve, Kibibi Moseley, and Quynh L. Nguyen, XANI has demonstrated the ability to compress the analysis of 42 terabytes of X-ray data from nine months to less than four hours. This acceleration is achieved through a specific hardware and software configuration demonstrated on 32 NVIDIA GB200 Grace Blackwell Superchips. The team’s work on characterizing quantum materials served as a proving ground for this new approach, successfully reconstructing phonon dispersion from ultrafast experiments in a fraction of the previously required time.
These libraries optimize both numerical computation and data input/output. GDS, a technology enabling data to bypass the CPU and memory directly to the GPU, consistently demonstrates higher throughput than conventional POSIX reads that go through the CPU host. The XANI architecture leverages data layout optimization and distributed computation, partitioning arrays across a cluster’s aggregate memory. The NVIDIA team accelerated the XANI workflow 43x on a single GPU on a GB200 Grace Blackwell Superchip and 1,000x on 64 GPUs, demonstrating the scalability of the solution.
Each tile’s compute-fitting 16 pixel traces-is too small to saturate a modern GPU, so execution time is dominated by fixed per-tile overhead (task dispatch, data movement, setup).
cuPyNumeric Libraries Enhance GPU-Based Numerical Computation
The pursuit of faster data analysis in materials science has led NVIDIA researchers to develop a suite of cuPyNumeric libraries significantly accelerating X-ray data processing. Previously, analyzing 42 terabytes of data from these experiments could require nine months of computational time. GDS provides higher throughput than conventional POSIX reads that go through the CPU host, in particular, bypassing the CPU and memory during data transfer, allowing data to be read directly into the GPU. “Fully utilizing GDS on modern clusters requires tuning cuFile configuration parameters to enable higher read parallelism,” the researchers note, highlighting the importance of careful optimization. Addressing the single-threaded nature of the HDF5 library, the team developed multithreading support, integrated with cuPyNumeric, and is actively preparing it for inclusion in the HDF5 main branch.
This eliminates the overhead of staging data through host memory and issuing a separate device copy.
GPUDirect Storage Achieves 165x I/O Performance Improvement
The demand for faster data processing in materials science is now being met with significant advancements in storage technology, directly impacting the pace of discovery at large-scale facilities. This leap forward allows for the analysis of previously insurmountable datasets, opening new avenues for understanding complex material behaviors. The bottleneck traditionally lay in the data transfer process itself. Conventional systems required data to pass through the CPU and system memory before reaching the GPU, creating significant overhead. This direct pathway eliminates the delays associated with staging data through host memory and initiating a separate device copy, dramatically accelerating the process. Experiments on high-performance Lustre storage systems confirmed these gains specifically with GDS, achieving 76 GB/second on a single node with two GB200 Grace Blackwell Superchips, and scaling to 700 GB/second across 16 nodes with 32 of the same superchips.
Further optimization involved addressing limitations within the HDF5 data format, a common standard for storing the massive datasets generated by X-ray free-electron lasers. While cuFile, used with GDS, can break down a single HDF5 read into multiple subrequests, the HDF5 library itself was previously single-threaded. To resolve this, developers created multithreading support for HDF5 and integrated it with cuPyNumeric. Additionally, careful attention was paid to data layout on disk, ensuring that reads were consistent with the HDF5 structure to maximize efficiency.
The new batched GPU implementation achieves ~3x better GPU utilization than any previous results, as detailed in Using Accelerated Computing to Live-Steer Scientific Experiments at Massive Research Facilities.
XANI Architecture Enables Real-Time Experiment Steering
The expectation that analyzing complex X-ray data requires months of processing time is rapidly becoming outdated. A new workflow, dubbed XANI (Accelerated X-ray Analysis for Nanoscale Imaging), is dramatically reducing the lag between data acquisition and scientific insight, enabling what amounts to real-time steering of experiments at facilities like SwissFEL and LCLS-II. Where previously 42 terabytes of data from materials science investigations could take nine months to fully analyze, the XANI workflow now accomplishes the same task in under four hours. Processing this deluge traditionally required significant manual parameter tuning and often limited analysis to only 10% of a dataset during the experiment itself.
By focusing on batching, explicit Python tasks, and GDS-backed I/O, scientific pipelines can scale to meet the petabyte-scale demands of next-generation facilities without requiring scientists to become lower-level C++/MPI experts.
