deep-dive

How Gradient Accumulation Works in an 8xH100 Distributed Setup

April 17, 2026 openai/parameter-golf ↗

In an 8xH100 distributed setup, gradient accumulation inversely scales with the number of GPUs to maintain a constant effective batch size, processing one micro-batch per GPU when fully distributed (8 GPUs) and accumulating up to 8 steps on a single GPU.

The openai/parameter-golf repository implements a clever scaling strategy that keeps training behavior identical across different hardware configurations. Whether you are running on a full 8xH100 node or a single GPU workstation, the total number of tokens processed per optimizer step remains constant through dynamic gradient accumulation.

The Mathematical Relationship Between World Size and Accumulation Steps

The core logic resides in train_gpt.py at lines 750–751, where the number of gradient accumulation steps is calculated inversely to the distributed world size:

grad_accum_steps = 8 // world_size            # world_size = # of GPUs

grad_scale = 1.0 / grad_accum_steps

This formula creates a fixed total of 8 "micro-batch slots" that are distributed across available GPUs:

8 GPUs (world_size = 8): grad_accum_steps = 1
No accumulation occurs. Each GPU processes a full micro-batch and synchronizes gradients immediately.
4 GPUs (world_size = 4): grad_accum_steps = 2
Two forward-backward passes are performed before gradient synchronization, doubling the per-GPU batch size while maintaining the global effective batch size.
2 GPUs (world_size = 2): grad_accum_steps = 4
Four micro-steps accumulate gradients locally before the all-reduce operation.
1 GPU (world_size = 1): grad_accum_steps = 8
Full 8-step accumulation on a single device, matching the 8-GPU effective batch size without distributed communication.

Memory Optimization Through Micro-Batching

By dividing the effective batch into smaller micro-batches, the implementation reduces peak activation memory. The logic at lines 886–889 in train_gpt.py calculates local token counts by dividing the global token count by both world_size and grad_accum_steps:

local_tokens = global_tokens // (self.world_size * grad_accum_steps)

This ensures each forward pass processes only a fraction of the data, keeping memory usage manageable even on 16 GB GPUs when the effective batch size would otherwise exceed VRAM capacity.

Controlled Gradient Synchronization

The training loop explicitly controls when distributed gradient synchronization occurs. According to lines 945–946 in train_gpt.py, require_backward_grad_sync is set to True only on the final micro-step:

model.require_backward_grad_sync = (micro_step == grad_accum_steps - 1)

This prevents unnecessary all-reduce operations during intermediate accumulation steps, reducing communication overhead by a factor equal to grad_accum_steps.

Loss Scaling and Gradient Magnitude

To maintain consistent gradient magnitudes across different world sizes, the loss is scaled before backpropagation. Line 1019 in train_gpt.py implements:

train_loss /= grad_accum_steps

Additionally, the grad_scale factor (defined as 1.0 / grad_accum_steps) ensures that accumulated gradients represent the average across all micro-batches, preventing gradient explosion when accumulating over many steps on single-GPU configurations.

Practical Training Loop Example

The following fragment from lines 940–975 in train_gpt.py demonstrates the complete accumulation logic:

grad_accum = None
grad_scale = 1.0 / grad_accum_steps

for micro_step in range(grad_accum_steps):
    # Only synchronize on the last micro-step

    model.require_backward_grad_sync = (micro_step == grad_accum_steps - 1)
    
    # Load micro-batch

    x, y = train_loader.next_batch(
        args.train_batch_tokens,
        args.train_seq_len,
        grad_accum_steps
    )
    
    # Forward and scale loss

    loss = model(x, y)
    loss = loss * grad_scale
    
    # Backward accumulation

    loss.backward()
    grad_accum = accumulate_flat_grads(grad_accum, loss.grad, grad_scale)

# Average loss for logging

train_loss = loss.item() / grad_accum_steps
optimizer.step()
optimizer.zero_grad(set_to_none=True)

Summary

Inverse scaling of grad_accum_steps (calculated as 8 // world_size) maintains constant effective batch size across 1–8 H100 GPUs.
Memory efficiency is achieved by processing smaller micro-batches (local_tokens = global_tokens // (world_size * grad_accum_steps)), reducing peak VRAM usage.
Communication optimization occurs through require_backward_grad_sync logic that triggers all-reduce only on the final accumulation step.
Numerical stability is preserved by scaling loss and gradients by 1.0 / grad_accum_steps, ensuring consistent optimizer behavior regardless of GPU count.

Frequently Asked Questions

What is the relationship between world_size and grad_accum_steps in parameter-golf?

The relationship is strictly inverse: grad_accum_steps = 8 // world_size. This means that as you add GPUs to your distributed setup, the number of gradient accumulation steps decreases proportionally. On a full 8xH100 node, grad_accum_steps equals 1, while on a single GPU, it equals 8, ensuring the effective batch size remains identical across configurations.

How does gradient accumulation reduce memory usage in distributed training?

Gradient accumulation reduces memory usage by splitting the effective batch into smaller micro-batches. In train_gpt.py, the token loader calculates local_tokens = global_tokens // (world_size * grad_accum_steps), meaning each forward pass processes only a fraction of the total data. This reduces peak activation memory, allowing large effective batch sizes to fit on GPUs with limited VRAM, such as 16 GB cards.

When does gradient synchronization occur during accumulation loops?

Gradient synchronization occurs only on the final micro-step of the accumulation loop. According to lines 945–946 in train_gpt.py, the code sets model.require_backward_grad_sync = (micro_step == grad_accum_steps - 1). This Boolean flag ensures that the expensive all-reduce operation happens only once per effective batch, after all local gradients have been accumulated, minimizing communication overhead in the distributed 8xH100 setup.

Can I run the parameter-golf training script on fewer than 8 H100 GPUs?

Yes, the training script is designed to run on any number of GPUs from 1 to 8 without code changes. The automatic scaling of grad_accum_steps (calculated as 8 // world_size) ensures that the effective batch size remains constant whether you use 8, 4, 2, or 1 GPU. However, wall-clock time increases as GPU count decreases because more forward passes are required to complete the accumulation steps, trading computation time for the flexibility to run on smaller hardware configurations.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:

curl -s "https://instagit.com/install.md"

Add to your MCP client configuration:

{
  "mcpServers": {
    "instagit": {
      "command": "npx",
      "args": ["-y", "instagit@latest"]
    }
  }
}

Ask your agent:

"Use Instagit MCP to understand how openai/parameter-golf works."

Works with

Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →