# How Gradient Accumulation Works in an 8xH100 Distributed Setup

> Discover how gradient accumulation optimizes an 8xH100 distributed setup. Learn how micro-batches and GPU processing maintain a constant effective batch size for superior performance.

- Repository: [OpenAI/parameter-golf](https://github.com/openai/parameter-golf)
- Tags: deep-dive
- Published: 2026-04-17

---

**In an 8xH100 distributed setup, gradient accumulation inversely scales with the number of GPUs to maintain a constant effective batch size, processing one micro-batch per GPU when fully distributed (8 GPUs) and accumulating up to 8 steps on a single GPU.**

The `openai/parameter-golf` repository implements a clever scaling strategy that keeps training behavior identical across different hardware configurations. Whether you are running on a full 8xH100 node or a single GPU workstation, the total number of tokens processed per optimizer step remains constant through dynamic gradient accumulation.

## The Mathematical Relationship Between World Size and Accumulation Steps

The core logic resides in [`train_gpt.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt.py) at lines 750–751, where the number of gradient accumulation steps is calculated inversely to the distributed world size:

```python
grad_accum_steps = 8 // world_size            # world_size = # of GPUs

grad_scale = 1.0 / grad_accum_steps

```

This formula creates a fixed total of 8 "micro-batch slots" that are distributed across available GPUs:

- **8 GPUs (`world_size = 8`)**: `grad_accum_steps = 1`  
  No accumulation occurs. Each GPU processes a full micro-batch and synchronizes gradients immediately.

- **4 GPUs (`world_size = 4`)**: `grad_accum_steps = 2`  
  Two forward-backward passes are performed before gradient synchronization, doubling the per-GPU batch size while maintaining the global effective batch size.

- **2 GPUs (`world_size = 2`)**: `grad_accum_steps = 4`  
  Four micro-steps accumulate gradients locally before the all-reduce operation.

- **1 GPU (`world_size = 1`)**: `grad_accum_steps = 8`  
  Full 8-step accumulation on a single device, matching the 8-GPU effective batch size without distributed communication.

## Memory Optimization Through Micro-Batching

By dividing the effective batch into smaller micro-batches, the implementation reduces peak activation memory. The logic at lines 886–889 in [`train_gpt.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt.py) calculates local token counts by dividing the global token count by both `world_size` and `grad_accum_steps`:

```python
local_tokens = global_tokens // (self.world_size * grad_accum_steps)

```

This ensures each forward pass processes only a fraction of the data, keeping memory usage manageable even on 16 GB GPUs when the effective batch size would otherwise exceed VRAM capacity.

## Controlled Gradient Synchronization

The training loop explicitly controls when distributed gradient synchronization occurs. According to lines 945–946 in [`train_gpt.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt.py), `require_backward_grad_sync` is set to `True` only on the final micro-step:

```python
model.require_backward_grad_sync = (micro_step == grad_accum_steps - 1)

```

This prevents unnecessary all-reduce operations during intermediate accumulation steps, reducing communication overhead by a factor equal to `grad_accum_steps`.

## Loss Scaling and Gradient Magnitude

To maintain consistent gradient magnitudes across different world sizes, the loss is scaled before backpropagation. Line 1019 in [`train_gpt.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt.py) implements:

```python
train_loss /= grad_accum_steps

```

Additionally, the `grad_scale` factor (defined as `1.0 / grad_accum_steps`) ensures that accumulated gradients represent the average across all micro-batches, preventing gradient explosion when accumulating over many steps on single-GPU configurations.

## Practical Training Loop Example

The following fragment from lines 940–975 in [`train_gpt.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt.py) demonstrates the complete accumulation logic:

```python
grad_accum = None
grad_scale = 1.0 / grad_accum_steps

for micro_step in range(grad_accum_steps):
    # Only synchronize on the last micro-step

    model.require_backward_grad_sync = (micro_step == grad_accum_steps - 1)
    
    # Load micro-batch

    x, y = train_loader.next_batch(
        args.train_batch_tokens,
        args.train_seq_len,
        grad_accum_steps
    )
    
    # Forward and scale loss

    loss = model(x, y)
    loss = loss * grad_scale
    
    # Backward accumulation

    loss.backward()
    grad_accum = accumulate_flat_grads(grad_accum, loss.grad, grad_scale)

# Average loss for logging

train_loss = loss.item() / grad_accum_steps
optimizer.step()
optimizer.zero_grad(set_to_none=True)

```

## Summary

- **Inverse scaling** of `grad_accum_steps` (calculated as `8 // world_size`) maintains constant effective batch size across 1–8 H100 GPUs.
- **Memory efficiency** is achieved by processing smaller micro-batches (`local_tokens = global_tokens // (world_size * grad_accum_steps)`), reducing peak VRAM usage.
- **Communication optimization** occurs through `require_backward_grad_sync` logic that triggers all-reduce only on the final accumulation step.
- **Numerical stability** is preserved by scaling loss and gradients by `1.0 / grad_accum_steps`, ensuring consistent optimizer behavior regardless of GPU count.

## Frequently Asked Questions

### What is the relationship between world_size and grad_accum_steps in parameter-golf?

The relationship is strictly inverse: `grad_accum_steps = 8 // world_size`. This means that as you add GPUs to your distributed setup, the number of gradient accumulation steps decreases proportionally. On a full 8xH100 node, `grad_accum_steps` equals 1, while on a single GPU, it equals 8, ensuring the effective batch size remains identical across configurations.

### How does gradient accumulation reduce memory usage in distributed training?

Gradient accumulation reduces memory usage by splitting the effective batch into smaller micro-batches. In [`train_gpt.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt.py), the token loader calculates `local_tokens = global_tokens // (world_size * grad_accum_steps)`, meaning each forward pass processes only a fraction of the total data. This reduces peak activation memory, allowing large effective batch sizes to fit on GPUs with limited VRAM, such as 16 GB cards.

### When does gradient synchronization occur during accumulation loops?

Gradient synchronization occurs only on the final micro-step of the accumulation loop. According to lines 945–946 in [`train_gpt.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt.py), the code sets `model.require_backward_grad_sync = (micro_step == grad_accum_steps - 1)`. This Boolean flag ensures that the expensive all-reduce operation happens only once per effective batch, after all local gradients have been accumulated, minimizing communication overhead in the distributed 8xH100 setup.

### Can I run the parameter-golf training script on fewer than 8 H100 GPUs?

Yes, the training script is designed to run on any number of GPUs from 1 to 8 without code changes. The automatic scaling of `grad_accum_steps` (calculated as `8 // world_size`) ensures that the effective batch size remains constant whether you use 8, 4, 2, or 1 GPU. However, wall-clock time increases as GPU count decreases because more forward passes are required to complete the accumulation steps, trading computation time for the flexibility to run on smaller hardware configurations.