Scientists are tackling the escalating challenge of checkpointing the enormous state of modern Large Language Models (LLMs), now routinely exceeding trillions of parameters and trained across vast GPU clusters. Avinash Maurya from Argonne National Laboratory, M. Mustafa Rafique from Rochester Institute of Technology, and Franck Cappello et al. present DataStates-LLM, a novel checkpointing system designed to overcome limitations in existing approaches. This research is significant because it moves beyond treating model state as a simple binary file, instead intelligently managing the ‘3D heterogeneity’ of data , its location, structure, and type , to dramatically improve performance. By employing State Providers and asynchronous snapshots, DataStates-LLM achieves up to four times higher checkpointing throughput and reduces overall training time by up to 2.2% on 70 billion parameter models, paving the way for more efficient and scalable LLM training.

The research addresses a critical bottleneck in scaling LLMs, which now routinely exceed 70 billion parameters and require training across thousands of GPUs using complex parallelisation strategies. Checkpointing, saving the model’s state, is essential for resilience, allowing recovery from hardware failures and enabling investigation of training trajectories, but existing methods struggle with the “3D heterogeneity” of LLM data. This heterogeneity arises from variations in data location (GPU versus host memory), the number of fragmented data objects, differing data types, and their unique serialisation requirements.

The team achieved significant performance gains by decoupling state abstraction from data movement, introducing State Providers to facilitate efficient checkpointing. DataStates-LLM exploits the immutability of model parameters during training to perform “lazy”, non-blocking asynchronous snapshots. This innovative approach avoids blocking device-to-host transfers, data-oblivious serialisation, and storage I/O contention, all of which contribute to runtime overheads in conventional checkpointing solutions. By coalescing fragmented data shards and overlapping metadata serialisation with bulk tensor I/O, the researchers have created a system that dramatically improves checkpointing efficiency.
Experiments conducted on models up to 70B parameters, distributed across 256 A100-40GB GPUs, demonstrate the effectiveness of DataStates-LLM. The study reveals that the new checkpointing method achieves up to 4x higher checkpointing throughput compared to state-of-the-art solutions. Crucially, this translates to a reduction in end-to-end training time of up to 2.2x, effectively mitigating the serialisation and heterogeneity bottlenecks that plague extreme-scale LLM training. This breakthrough is particularly important given the increasing frequency of failures during long training runs, such as the Llama 3 405B model which experienced failures every 2.8 hours.

The research establishes a new paradigm for managing the massive, distributed state of LLMs, paving the way for more robust and efficient training workflows. Beyond resilience, DataStates-LLM supports critical applications like reinforcement learning from human feedback and transfer learning, where frequent checkpointing is essential. The research team tackled inefficiencies stemming from treating model state as opaque data, instead focusing on the “3D heterogeneity” inherent in LLM data structures, varying memory location, logical object sharding, data types, and serialization needs. To achieve this, they engineered State Providers, decoupling state abstraction from data movement and enabling lazy, non-blocking asynchronous snapshots of model parameters during forward and backward passes. Experiments employed five production-representative LLM configurations, BLOOM-3B, Llama 7B, 13B, 33B, and 70B, utilizing tensor parallelism (TP=4) to exploit NVLink for intra-layer collectives.
The study partitioned models with pipeline parallelism (PP) across nodes, employing DeepSpeed/Megatron’s default uniform partitioning to balance trainable parameters per stage, and used DeepSpeed ZeRO-1 to shard optimizer state across replicas for data parallelism (DP) experiments. Training was conducted on the OSCAR-en dataset, tokenized with the Llama 2 tokenizer, using a sequence length of 2048 and a micro-batch size of 16 to prevent out-of-memory errors. The team configured a bounded pinned host cache of 80 GB per node, with remaining host memory reserved for dataloader buffers and runtime allocations, ensuring sufficient space to hold one full checkpoint version across four GPUs per node for overlapping operations. Checkpoint shards were flushed directly to a Lustre PFS, serving as stable shared storage.

Performance was evaluated using checkpointing throughput, iteration duration during checkpointing, and end-to-end training runtime, varying data parallelism and checkpointing frequency to assess scalability and I/O pressure. Results demonstrate that DataStates-LLM achieves up to 4x higher checkpointing throughput and reduces end-to-end training time by up to 2.2 compared to state-of-the-art solutions, effectively mitigating serialization and heterogeneity bottlenecks. The approach leverages asynchronous C++ based I/O and zero training-to-checkpoint cross-process serialization, achieving substantial gains in throughput and reducing checkpointing time to negligible levels, up to 2x faster checkpointed iterations than its predecessor, DataStates-LLM-Old.

DataStates-LLM tackles 3D heterogeneity in LLM checkpointing efficiently

Scientists have developed DataStates-LLM, a novel checkpointing architecture designed to overcome limitations in training extremely large language models (LLMs). The research addresses critical challenges associated with checkpointing model state in distributed training environments, where models now routinely scale to 70 billion parameters and beyond, utilising 256 A100-40GB GPUs. Experiments revealed that existing checkpointing solutions treat model state as simple binary data, failing to account for the “3D heterogeneity” inherent in the underlying data structures, variations in memory location, logical object sharding, data types, and serialization needs. This inefficiency leads to substantial runtime overheads stemming from blocking device-to-host transfers, data-oblivious serialization, and storage I/O contention.

The team measured significant improvements in checkpointing throughput and overall training time using DataStates-LLM. Results demonstrate that the new architecture achieves up to 4× higher checkpointing throughput compared to state-of-the-art solutions. Furthermore, end-to-end training time is reduced by up to 2.2×, effectively mitigating the bottlenecks caused by serialization and data heterogeneity in large-scale LLM training. By efficiently coalescing fragmented data shards and overlapping metadata serialization with bulk tensor I/O, the system minimizes interruptions to the training process.

Scientists recorded that the immutability of model parameters during forward and backward passes is exploited to perform these asynchronous snapshots. The work highlights the importance of addressing the 3D heterogeneity of LLM checkpoints, which arises from the combination of data, pipeline, and tensor parallelism, alongside techniques like ZeRO and FSDP. Measurements confirm that DataStates-LLM’s approach avoids the blocking behaviour of conventional checkpointing methods, which can significantly impede training progress. The breakthrough delivers a scalable solution for capturing distributed AI model state frequently without incurring substantial overheads.

This research is particularly relevant given the increasing frequency of hardware failures and software bugs in large-scale LLM training runs, such as the Llama 3 405B model training which experienced failures every 2.8 hours. The ability to quickly and efficiently checkpoint model state is crucial for resilience and productivity, allowing researchers to resume training from the last checkpoint rather than restarting from scratch. Alibaba’s Unicron training, with a reported failure rate of 43.4%, further underscores the need for robust checkpointing mechanisms. DataStates-LLM’s performance gains promise to accelerate LLM development and enable more ambitious training runs.

DataStates-LLM accelerates large language model checkpointing significantly

Scientists have developed DataStates-LLM, a novel checkpointing system designed to address the challenges of training extremely large language models (LLMs) with trillions of parameters. This new approach leverages State Providers to separate state abstraction from data movement, enabling more efficient checkpointing in distributed training environments. DataStates-LLM exploits the immutability of model parameters during training to perform lazy, asynchronous snapshots, streamlining data handling and reducing bottlenecks, specifically those related to device-to-host transfers, data serialization, and storage I/O contention. Evaluations on models up to 70 billion parameters, utilising 256 A100-40GB GPUs, demonstrate that DataStates-LLM achieves up to four times higher checkpointing throughput and reduces overall training time by up to 2.2times compared to existing state-of-the-art solutions.

The authors acknowledge limitations related to network and storage costs at high checkpoint rates, as well as potential metadata pressure on the parallel file system, areas they intend to address in future work. They plan to extend DataStates-LLM with data reduction techniques like differential checkpointing and compression, and to explore support for offloaded model states across deeper memory tiers, alongside shard aggregation to mitigate metadata issues without compromising parallelism. These improvements aim to further optimise the efficiency and scalability of training large language models, paving the way for even more powerful AI systems.

👉 More information
🗞 DataStates-LLM: Scalable Checkpointing for Transformer Models Using Composable State Providers
🧠 ArXiv: https://arxiv.org/abs/2601.16956

Tags:

A100 GPUs! asynchronous snapshots checkpointing data heterogeneity Data Parallelism Large Language Models LLM Training State Providers

Datastates-LLM Achieves Scalable Checkpointing for Transformer Models with Trillions of Parameters

DataStates-LLM tackles 3D heterogeneity in LLM checkpointing efficiently

DataStates-LLM accelerates large language model checkpointing significantly

Rohail T.

Latest Posts by Rohail T.:

Quantum Error Correction Gains a Clearer Building Mechanism for Robust Codes

Protected: Models Achieve Reliable Accuracy and Exploit Atomic Interactions Efficiently

Protected: Quantum Computing Tackles Fluid Dynamics with a New, Flexible Algorithm