Training large language models (LLMs) comes with two primary storage requirements:
- Training data: drives weight updates and convergence.
- Checkpoints: periodically save model states from GPU memory to persistent storage.
Training Data #
Transformer training is compute-intensive rather than data-intensive. Tokenized datasets are relatively compact:
- Each English token is ~4 bytes.
- One trillion tokens require only a few terabytes of storage.
State-of-the-art LLMs train on tens of trillions of tokens, translating into tens of terabytes of processed text.
Checkpoints #
Checkpoint requirements scale with model size, not cluster size. A rough estimate assumes 16 bytes per parameter, covering weights, gradients, and optimizer states.
Performance considerations also depend primarily on model size. To reduce bottlenecks, large systems often adopt asynchronous multi-tier checkpointing:
- Copy weights from GPU memory to host memory (GPU briefly stalls).
- GPU resumes training while CPU asynchronously flushes to non-volatile storage.
ByteDance’s MegaScale system and Microsoft’s Nebula framework employ such strategies, layering checkpoints across host memory, local SSDs, and distributed object storage for resilience.
Fast Checkpointing and Recovery (ByteDance MegaScale) #
Recovery requires loading the most recent checkpointed weights and optimizer state—ideally close to the failure point. Frequent checkpoints reduce lost work but add latency to the training pipeline.
MegaScale introduced a two-phase design:
- Phase 1: Each GPU writes its state to host memory in seconds, minimizing disruption.
- Phase 2: A background process asynchronously transfers data to HDFS, decoupling long I/O from the critical path.
For recovery, MegaScale optimized data retrieval: one GPU in each data-parallel group reads a shared state partition from HDFS and broadcasts it to peers, reducing bandwidth demands and speeding up recovery.
The Checkpointing Challenge in LLMs #
Modern LLMs face explosive growth in datasets (petabyte scale) and models (hundreds of gigabytes to terabytes). With models far exceeding a single GPU’s memory, efficient parallelization and recoverability are critical.
Full-Dimension Parallelism #
Training GPT-3 (175B parameters, ~500 GB) on a single GPU with 80 GB memory would take 300+ years. Thus, hybrid parallelism is essential:
- Data parallelism: replicate model across devices, split data. Simple but memory-hungry.
- Model parallelism: split layers or tensors across devices. Memory efficient but complex.
- Pipeline parallelism: stage execution across devices. Improves throughput, may add latency.
Stanford, NVIDIA, and Microsoft demonstrated that combining all three significantly boosts LLM training efficiency.
Checkpoints for Recoverability #
Since LLM training may span months, checkpointing ensures:
- Progress can resume after failures.
- Hyperparameters can be rolled back.
- Training is reproducible.
Although training is GPU-bound, checkpointing is I/O-bound: writes dominate during saving, while reads bottleneck recovery.
Checkpoint Size: A Matter of Model Scale #
Checkpoint size scales linearly with model size, independent of dataset size, frequency, or GPU count.
Example – GPT-3 (175B parameters):
- Model size: ~500 GB.
- Checkpointing across 1024 GPUs (128 DGX-A100 nodes).
- Combined tensor, pipeline, and data parallelism yield efficient checkpoint writes.
Key point: only one pipeline group writes a checkpoint—not every GPU.
This approach extends to trillion-parameter models with near-linear scaling efficiency.
Mathematical Analysis of Checkpoints #
For trillion-parameter models:
- Checkpoint writes: ~13.8 TB.
- Write bandwidth: ~273 GB/s (40% of system peak).
- Write latency: ~50 seconds, every 4 hours → just 0.3% overhead.
Recovery requires higher throughput:
- Read bandwidth = data parallelism × write bandwidth.
- Example: 6× replication → 1.64 TB/s read bandwidth needed.
Storage must deliver both ~1.6 TB/s read and ~280 GB/s write for optimal 50-second recovery.
Common Misconceptions #
-
Myth: Each GPU requires 1 GB/s for checkpointing.
-
Reality: Only pipeline-parallel groups perform writes; per-GPU requirements are much lower.
-
Myth: Checkpoints depend on dataset size.
-
Reality: Checkpoints depend only on model parameters.
Conclusion #
As LLM training scales to the trillion-parameter era, storage architecture becomes as critical as compute.
Checkpointing must be guided by rigorous analysis, not rough guesses—because small inefficiencies at scale translate into huge costs. With proper checkpointing design, training systems can remain resilient, efficient, and scalable in the face of massive model and data growth.