Exponential Moving Average (EMA) Implementation for Small Models in Parameter-Golf

Exponential Moving Average (EMA) in the Parameter-Golf repository is implemented as a lightweight, dictionary-based weight averaging system controlled by environment variables, updating model parameters in-place after each optimizer step without requiring additional model copies.

The openai/parameter-golf repository provides a minimal, efficient training framework for small-scale language models. EMA serves as an optional regularization technique that maintains a shadow copy of model weights, helping to stabilize training and improve final model quality without the memory overhead of traditional checkpoint averaging.

How EMA Is Configured in Parameter-Golf

Environment Variable Controls

EMA is disabled by default and activated through environment variables read at startup. In train_gpt.py, lines 106–107 parse these settings directly from os.environ:

ema_enabled = int(os.environ.get('EMA_ENABLED', '0'))
ema_decay = float(os.environ.get('EMA_DECAY', '0.997'))
  • EMA_ENABLED: Set to 1 to activate EMA tracking (default 0)
  • EMA_DECAY: Controls the smoothing factor, typically set between 0.99 and 0.9997 (default 0.997)

The Dictionary-Based EMA State Initialization

When EMA is enabled, the system creates a shadow state dictionary that mirrors the model's parameters. This approach avoids duplicating the entire model structure, making it memory-efficient for small models (e.g., 9 layers × 512 hidden size).

At lines 1342–1343 in train_gpt.py, the initialization occurs:

ema_state = {name: t.detach().float().clone() 
             for name, t in base_model.state_dict().items()}

This creates float32 copies of every parameter, ensuring precision during the averaging process regardless of the model's training dtype (FP16 or BF16).

Per-Step EMA Update Mechanism

After each optimizer step, the EMA state is updated in-place using a decay formula. The implementation avoids gradient tracking and uses efficient in-place operations to minimize overhead.

Lines 1420–1424 in train_gpt.py contain the core update logic:

with torch.no_grad():
    for name, t in base_model.state_dict().items():
        ema_state[name].mul_(ema_decay).add_(
            t.detach().float(), 
            alpha=1.0 - ema_decay
        )

The formula ema_state = ema_decay * ema_state + (1 - ema_decay) * current_param computes the exponential moving average efficiently.

Optional Warm-Up Schedule

Some experimental runs apply a warm-up decay schedule that gradually increases the effective decay rate early in training:

decay = min(args.ema_decay, (1.0 + step) / (10.0 + step))

This conservative start helps prevent the EMA from locking in to noisy initial parameter states.

Applying EMA Weights After Training

When training completes (or when evaluation occurs after a wall-clock time cap), the EMA weights are copied back into the base model. This replaces the trained weights with the smoothed averages, typically yielding better generalization.

Lines 1460–1464 in train_gpt.py handle this finalization:

avg_state = {
    name: ema_state[name].to(dtype=base_model.state_dict()[name].dtype)
    for name in ema_state
}
base_model.load_state_dict(avg_state, strict=True)

The dtype conversion ensures compatibility with the original model's precision settings before loading.

Advanced Usage with Delayed Start

For experiments requiring stable initialization before averaging begins, the train_gpt_cuda_binary.py script supports delayed EMA activation. The EMA_START_FRACTION environment variable determines whether EMA starts based on wall-clock time or iteration count.

The implementation checks whether to initiate EMA at each step:

if not _ema_started:
    should_start = (elapsed_ms >= args.ema_start_fraction * max_wallclock_ms) or \
                   (step >= int(args.iterations * args.ema_start_fraction))
    if should_start:
        _ema_started = True
        # Initialize EMA parameters from current model state

        for ema_p, base_p in zip(ema_model.parameters(), base_model.parameters()):
            ema_p.data.copy_(base_p.data)

Once started, the update logic follows the same decay formula as the baseline implementation.

Summary

  • EMA in Parameter-Golf is controlled via environment variables (EMA_ENABLED, EMA_DECAY) and implemented as a dictionary of float tensors shadowing the model parameters.
  • Memory efficiency comes from storing only parameter tensors (not full model copies), making it suitable for small models with 9–12 layers and 512–1024 hidden dimensions.
  • Update mechanism uses in-place tensor operations with torch.no_grad() to compute ema_state = decay * ema_state + (1 - decay) * params after each optimizer step.
  • Finalization replaces the trained weights with EMA-averaged weights before evaluation or checkpointing by loading the shadow state back into the model.
  • Advanced features include warm-up decay schedules and delayed start options based on wall-clock time or iteration fractions.

Frequently Asked Questions

What is the default EMA decay rate in Parameter-Golf?

The default decay rate is 0.997, set via the EMA_DECAY environment variable. This value provides strong smoothing for small models while still allowing the average to adapt to improvements during training. You can override this by setting EMA_DECAY=0.999 for slower adaptation or EMA_DECAY=0.99 for faster tracking.

Does enabling EMA significantly increase memory usage?

No, the implementation is designed to be memory-efficient for small models. Rather than duplicating the entire model structure, Parameter-Golf stores only a dictionary of parameter tensors in float32 precision. For a small model with ~9 layers and 512 hidden dimensions, this adds only a few hundred megabytes of GPU memory, making it practical for resource-constrained training runs.

Can I use EMA with the CUDA binary training script?

Yes, the train_gpt_cuda_binary.py script supports EMA with additional features like delayed start. You can set EMA_ENABLED=1 and optionally define EMA_START_FRACTION to begin averaging only after a certain percentage of training time or iterations has elapsed. This is particularly useful when early training steps produce noisy gradients that might contaminate the moving average.

When should I use the delayed start option for EMA?

Use the delayed start option when your training exhibits high gradient variance during the warm-up phase. In Parameter-Golf, this is controlled by EMA_START_FRACTION in the binary CUDA script. Setting this to 0.2 or 0.3 ensures EMA begins only after the learning rate has stabilized and the model has escaped the high-loss initial region, resulting in a more reliable final average.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →