# How the Muon Momentum Warmup Schedule Works in Parameter-Golf

> Discover how the Muon momentum warmup schedule works in Parameter-Golf. Learn its linear interpolation implementation and default values for optimizer momentum.

- Repository: [OpenAI/parameter-golf](https://github.com/openai/parameter-golf)
- Tags: internals
- Published: 2026-04-17

---

**The Muon momentum warmup schedule in openai/parameter-golf linearly interpolates the optimizer's momentum from a conservative start value (default 0.85) to the target value over a configurable number of steps (default 500) by updating the optimizer's `param_groups` each training step.**

The Muon momentum warmup schedule is a critical stabilization technique in the Parameter-Golf training framework that prevents early training instability when using the high-performing Muon optimizer. Implemented in the PyTorch training script, this schedule gradually ramps momentum from an initial conservative value to the final target rather than applying full momentum immediately.

## Hyperparameters Controlling the Warmup

The warmup behavior is governed by two environment variables parsed in [`train_gpt.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt.py) at lines 73-84:

| Hyperparameter | Purpose | Default |
|----------------|---------|---------|
| `MUON_MOMENTUM_WARMUP_START` | Momentum value at step 0 | `0.85` |
| `MUON_MOMENTUM_WARMUP_STEPS` | Number of steps to reach target momentum | `500` |

These values are stored in the `Hyperparameters` class and accessed via `args.muon_momentum_warmup_start` and `args.muon_momentum_warmup_steps` during the training loop.

## Calculating the Warmup Fraction

During each training step, the script computes a progress fraction `frac` that determines how much of the warmup has completed. This logic appears in [`train_gpt.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt.py) around lines 1021-1024:

```python
frac = min(step / args.muon_momentum_warmup_steps, 1.0) \
       if args.muon_momentum_warmup_steps > 0 else 1.0

```

The `step` variable represents the current training step starting at 0. The fraction is clamped at 1.0 to ensure momentum stops increasing once the warmup period concludes.

## Linear Interpolation of Momentum Values

Using the computed fraction, the implementation performs linear interpolation between the start and target momentum values:

```python
muon_momentum = (1 - frac) * args.muon_momentum_warmup_start \
                + frac * args.muon_momentum

```

This produces a smooth transition from the conservative `MUON_MOMENTUM_WARMUP_START` (typically 0.85) to the final `MUON_MOMENTUM` value (typically 0.95).

## Updating the Optimizer State

The calculated momentum value is applied to every parameter group in the Muon optimizer immediately before the optimization step:

```python
for group in optimizer_muon.param_groups:
    group["momentum"] = muon_momentum

```

This per-step update ensures the optimizer uses the current warmup value for the upcoming parameter update, providing fine-grained control over the optimization dynamics.

## Configuration Examples

### Default Behavior

To use the default warmup schedule (0.85 → 0.95 over 500 steps), simply run the training script without modification:

```bash
python train_gpt.py

```

### Custom Environment Variables

Adjust the warmup behavior by setting environment variables before execution:

```bash
MUON_MOMENTUM_WARMUP_START=0.9 \
MUON_MOMENTUM_WARMUP_STEPS=1000 \
python train_gpt.py

```

This configuration starts momentum at 0.9 and extends the warmup period to 1000 steps for more gradual stabilization.

### Debugging Momentum Values

To inspect momentum values during training, add debugging output after the warmup computation in [`train_gpt.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt.py):

```python

# After the muon_momentum calculation

print(f"[debug] step={step} momentum={muon_momentum:.4f}")

```

## Implementation in MLX Backend

The same warmup logic appears in [`train_gpt_mlx.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt_mlx.py) with a different default duration of 1500 steps, demonstrating the framework-agnostic nature of the schedule. The calculation formulas remain identical, ensuring consistent behavior across PyTorch and MLX implementations.

## Summary

- The Muon momentum warmup schedule linearly interpolates from `MUON_MOMENTUM_WARMUP_START` (default 0.85) to the target momentum over `MUON_MOMENTUM_WARMUP_STEPS` (default 500).
- The fraction calculation clamps at 1.0 to maintain final momentum after warmup completion.
- Each training step updates `optimizer_muon.param_groups` with the computed momentum value before the optimizer step executes.
- Configuration requires no code changes—set environment variables to customize behavior.
- The implementation is framework-agnostic, appearing in both [`train_gpt.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt.py) and [`train_gpt_mlx.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt_mlx.py).

## Frequently Asked Questions

### What is the default momentum warmup schedule in Parameter-Golf?

The default schedule starts momentum at 0.85 and linearly increases it to the target value (typically 0.95) over 500 training steps. This gradual ramp prevents early training instability while allowing the Muon optimizer to reach full momentum quickly.

### How do I disable the Muon momentum warmup?

Set `MUON_MOMENTUM_WARMUP_STEPS` to 0 or a negative value. When `args.muon_momentum_warmup_steps` is less than or equal to 0, the fraction calculation immediately returns 1.0, causing the optimizer to use the final target momentum from the first step.

### Why does the momentum warmup use linear interpolation?

Linear interpolation provides a predictable, smooth transition between the conservative starting momentum and the aggressive target momentum. This approach balances early training stability (lower momentum reduces noise in gradient updates) with later convergence speed (higher momentum accelerates optimization in consistent gradient directions).

### Does the warmup schedule work with the MLX backend?

Yes, the identical warmup logic is implemented in [`train_gpt_mlx.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt_mlx.py) with the same linear interpolation formulas and environment variable configuration. The only difference is the default warmup duration, which is set to 1500 steps in the MLX version rather than 500 steps in the PyTorch version.