How Gradient Accumulation Works in an 8xH100 Distributed Setup
In an 8xH100 distributed setup, gradient accumulation inversely scales with the number of GPUs to maintain a constant effective batch size, processing one micro-batch per GPU when fully distributed (8 GPUs) and accumulating up to 8 steps on a single GPU.
The openai/parameter-golf repository implements a clever scaling strategy that keeps training behavior identical across different hardware configurations. Whether you are running on a full 8xH100 node or a single GPU workstation, the total number of tokens processed per optimizer step remains constant through dynamic gradient accumulation.
The Mathematical Relationship Between World Size and Accumulation Steps
The core logic resides in train_gpt.py at lines 750–751, where the number of gradient accumulation steps is calculated inversely to the distributed world size:
grad_accum_steps = 8 // world_size # world_size = # of GPUs
grad_scale = 1.0 / grad_accum_steps
This formula creates a fixed total of 8 "micro-batch slots" that are distributed across available GPUs:
-
8 GPUs (
world_size = 8):grad_accum_steps = 1
No accumulation occurs. Each GPU processes a full micro-batch and synchronizes gradients immediately. -
4 GPUs (
world_size = 4):grad_accum_steps = 2
Two forward-backward passes are performed before gradient synchronization, doubling the per-GPU batch size while maintaining the global effective batch size. -
2 GPUs (
world_size = 2):grad_accum_steps = 4
Four micro-steps accumulate gradients locally before the all-reduce operation. -
1 GPU (
world_size = 1):grad_accum_steps = 8
Full 8-step accumulation on a single device, matching the 8-GPU effective batch size without distributed communication.
Memory Optimization Through Micro-Batching
By dividing the effective batch into smaller micro-batches, the implementation reduces peak activation memory. The logic at lines 886–889 in train_gpt.py calculates local token counts by dividing the global token count by both world_size and grad_accum_steps:
local_tokens = global_tokens // (self.world_size * grad_accum_steps)
This ensures each forward pass processes only a fraction of the data, keeping memory usage manageable even on 16 GB GPUs when the effective batch size would otherwise exceed VRAM capacity.
Controlled Gradient Synchronization
The training loop explicitly controls when distributed gradient synchronization occurs. According to lines 945–946 in train_gpt.py, require_backward_grad_sync is set to True only on the final micro-step:
model.require_backward_grad_sync = (micro_step == grad_accum_steps - 1)
This prevents unnecessary all-reduce operations during intermediate accumulation steps, reducing communication overhead by a factor equal to grad_accum_steps.
Loss Scaling and Gradient Magnitude
To maintain consistent gradient magnitudes across different world sizes, the loss is scaled before backpropagation. Line 1019 in train_gpt.py implements:
train_loss /= grad_accum_steps
Additionally, the grad_scale factor (defined as 1.0 / grad_accum_steps) ensures that accumulated gradients represent the average across all micro-batches, preventing gradient explosion when accumulating over many steps on single-GPU configurations.
Practical Training Loop Example
The following fragment from lines 940–975 in train_gpt.py demonstrates the complete accumulation logic:
grad_accum = None
grad_scale = 1.0 / grad_accum_steps
for micro_step in range(grad_accum_steps):
# Only synchronize on the last micro-step
model.require_backward_grad_sync = (micro_step == grad_accum_steps - 1)
# Load micro-batch
x, y = train_loader.next_batch(
args.train_batch_tokens,
args.train_seq_len,
grad_accum_steps
)
# Forward and scale loss
loss = model(x, y)
loss = loss * grad_scale
# Backward accumulation
loss.backward()
grad_accum = accumulate_flat_grads(grad_accum, loss.grad, grad_scale)
# Average loss for logging
train_loss = loss.item() / grad_accum_steps
optimizer.step()
optimizer.zero_grad(set_to_none=True)
Summary
- Inverse scaling of
grad_accum_steps(calculated as8 // world_size) maintains constant effective batch size across 1–8 H100 GPUs. - Memory efficiency is achieved by processing smaller micro-batches (
local_tokens = global_tokens // (world_size * grad_accum_steps)), reducing peak VRAM usage. - Communication optimization occurs through
require_backward_grad_synclogic that triggers all-reduce only on the final accumulation step. - Numerical stability is preserved by scaling loss and gradients by
1.0 / grad_accum_steps, ensuring consistent optimizer behavior regardless of GPU count.
Frequently Asked Questions
What is the relationship between world_size and grad_accum_steps in parameter-golf?
The relationship is strictly inverse: grad_accum_steps = 8 // world_size. This means that as you add GPUs to your distributed setup, the number of gradient accumulation steps decreases proportionally. On a full 8xH100 node, grad_accum_steps equals 1, while on a single GPU, it equals 8, ensuring the effective batch size remains identical across configurations.
How does gradient accumulation reduce memory usage in distributed training?
Gradient accumulation reduces memory usage by splitting the effective batch into smaller micro-batches. In train_gpt.py, the token loader calculates local_tokens = global_tokens // (world_size * grad_accum_steps), meaning each forward pass processes only a fraction of the total data. This reduces peak activation memory, allowing large effective batch sizes to fit on GPUs with limited VRAM, such as 16 GB cards.
When does gradient synchronization occur during accumulation loops?
Gradient synchronization occurs only on the final micro-step of the accumulation loop. According to lines 945–946 in train_gpt.py, the code sets model.require_backward_grad_sync = (micro_step == grad_accum_steps - 1). This Boolean flag ensures that the expensive all-reduce operation happens only once per effective batch, after all local gradients have been accumulated, minimizing communication overhead in the distributed 8xH100 setup.
Can I run the parameter-golf training script on fewer than 8 H100 GPUs?
Yes, the training script is designed to run on any number of GPUs from 1 to 8 without code changes. The automatic scaling of grad_accum_steps (calculated as 8 // world_size) ensures that the effective batch size remains constant whether you use 8, 4, 2, or 1 GPU. However, wall-clock time increases as GPU count decreases because more forward passes are required to complete the accumulation steps, trading computation time for the flexibility to run on smaller hardware configurations.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →