# Exponential Moving Average (EMA) Implementation for Small Models in Parameter-Golf

> Learn how Exponential Moving Average EMA is implemented for small models in Parameter-Golf. Discover its lightweight dictionary-based system.

- Repository: [OpenAI/parameter-golf](https://github.com/openai/parameter-golf)
- Tags: internals
- Published: 2026-04-17

---

**Exponential Moving Average (EMA) in the Parameter-Golf repository is implemented as a lightweight, dictionary-based weight averaging system controlled by environment variables, updating model parameters in-place after each optimizer step without requiring additional model copies.**

The `openai/parameter-golf` repository provides a minimal, efficient training framework for small-scale language models. EMA serves as an optional regularization technique that maintains a shadow copy of model weights, helping to stabilize training and improve final model quality without the memory overhead of traditional checkpoint averaging.

## How EMA Is Configured in Parameter-Golf

### Environment Variable Controls

EMA is disabled by default and activated through environment variables read at startup. In [`train_gpt.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt.py), lines 106–107 parse these settings directly from `os.environ`:

```python
ema_enabled = int(os.environ.get('EMA_ENABLED', '0'))
ema_decay = float(os.environ.get('EMA_DECAY', '0.997'))

```

- **EMA_ENABLED**: Set to `1` to activate EMA tracking (default `0`)
- **EMA_DECAY**: Controls the smoothing factor, typically set between `0.99` and `0.9997` (default `0.997`)

## The Dictionary-Based EMA State Initialization

When EMA is enabled, the system creates a shadow state dictionary that mirrors the model's parameters. This approach avoids duplicating the entire model structure, making it memory-efficient for small models (e.g., 9 layers × 512 hidden size).

At lines 1342–1343 in [`train_gpt.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt.py), the initialization occurs:

```python
ema_state = {name: t.detach().float().clone() 
             for name, t in base_model.state_dict().items()}

```

This creates **float32 copies** of every parameter, ensuring precision during the averaging process regardless of the model's training dtype (FP16 or BF16).

## Per-Step EMA Update Mechanism

After each optimizer step, the EMA state is updated in-place using a decay formula. The implementation avoids gradient tracking and uses efficient in-place operations to minimize overhead.

Lines 1420–1424 in [`train_gpt.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt.py) contain the core update logic:

```python
with torch.no_grad():
    for name, t in base_model.state_dict().items():
        ema_state[name].mul_(ema_decay).add_(
            t.detach().float(), 
            alpha=1.0 - ema_decay
        )

```

The formula `ema_state = ema_decay * ema_state + (1 - ema_decay) * current_param` computes the exponential moving average efficiently.

### Optional Warm-Up Schedule

Some experimental runs apply a warm-up decay schedule that gradually increases the effective decay rate early in training:

```python
decay = min(args.ema_decay, (1.0 + step) / (10.0 + step))

```

This conservative start helps prevent the EMA from locking in to noisy initial parameter states.

## Applying EMA Weights After Training

When training completes (or when evaluation occurs after a wall-clock time cap), the EMA weights are copied back into the base model. This replaces the trained weights with the smoothed averages, typically yielding better generalization.

Lines 1460–1464 in [`train_gpt.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt.py) handle this finalization:

```python
avg_state = {
    name: ema_state[name].to(dtype=base_model.state_dict()[name].dtype)
    for name in ema_state
}
base_model.load_state_dict(avg_state, strict=True)

```

The dtype conversion ensures compatibility with the original model's precision settings before loading.

## Advanced Usage with Delayed Start

For experiments requiring stable initialization before averaging begins, the [`train_gpt_cuda_binary.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt_cuda_binary.py) script supports delayed EMA activation. The `EMA_START_FRACTION` environment variable determines whether EMA starts based on wall-clock time or iteration count.

The implementation checks whether to initiate EMA at each step:

```python
if not _ema_started:
    should_start = (elapsed_ms >= args.ema_start_fraction * max_wallclock_ms) or \
                   (step >= int(args.iterations * args.ema_start_fraction))
    if should_start:
        _ema_started = True
        # Initialize EMA parameters from current model state

        for ema_p, base_p in zip(ema_model.parameters(), base_model.parameters()):
            ema_p.data.copy_(base_p.data)

```

Once started, the update logic follows the same decay formula as the baseline implementation.

## Summary

- **EMA in Parameter-Golf** is controlled via environment variables (`EMA_ENABLED`, `EMA_DECAY`) and implemented as a dictionary of float tensors shadowing the model parameters.
- **Memory efficiency** comes from storing only parameter tensors (not full model copies), making it suitable for small models with 9–12 layers and 512–1024 hidden dimensions.
- **Update mechanism** uses in-place tensor operations with `torch.no_grad()` to compute `ema_state = decay * ema_state + (1 - decay) * params` after each optimizer step.
- **Finalization** replaces the trained weights with EMA-averaged weights before evaluation or checkpointing by loading the shadow state back into the model.
- **Advanced features** include warm-up decay schedules and delayed start options based on wall-clock time or iteration fractions.

## Frequently Asked Questions

### What is the default EMA decay rate in Parameter-Golf?

The default decay rate is **0.997**, set via the `EMA_DECAY` environment variable. This value provides strong smoothing for small models while still allowing the average to adapt to improvements during training. You can override this by setting `EMA_DECAY=0.999` for slower adaptation or `EMA_DECAY=0.99` for faster tracking.

### Does enabling EMA significantly increase memory usage?

No, the implementation is designed to be **memory-efficient for small models**. Rather than duplicating the entire model structure, Parameter-Golf stores only a dictionary of parameter tensors in float32 precision. For a small model with ~9 layers and 512 hidden dimensions, this adds only a few hundred megabytes of GPU memory, making it practical for resource-constrained training runs.

### Can I use EMA with the CUDA binary training script?

Yes, the [`train_gpt_cuda_binary.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt_cuda_binary.py) script supports EMA with additional features like **delayed start**. You can set `EMA_ENABLED=1` and optionally define `EMA_START_FRACTION` to begin averaging only after a certain percentage of training time or iterations has elapsed. This is particularly useful when early training steps produce noisy gradients that might contaminate the moving average.

### When should I use the delayed start option for EMA?

Use the delayed start option when your training exhibits **high gradient variance during the warm-up phase**. In Parameter-Golf, this is controlled by `EMA_START_FRACTION` in the binary CUDA script. Setting this to `0.2` or `0.3` ensures EMA begins only after the learning rate has stabilized and the model has escaped the high-loss initial region, resulting in a more reliable final average.