How the Muon Momentum Warmup Schedule Works in Parameter-Golf
The Muon momentum warmup schedule in openai/parameter-golf linearly interpolates the optimizer's momentum from a conservative start value (default 0.85) to the target value over a configurable number of steps (default 500) by updating the optimizer's param_groups each training step.
The Muon momentum warmup schedule is a critical stabilization technique in the Parameter-Golf training framework that prevents early training instability when using the high-performing Muon optimizer. Implemented in the PyTorch training script, this schedule gradually ramps momentum from an initial conservative value to the final target rather than applying full momentum immediately.
Hyperparameters Controlling the Warmup
The warmup behavior is governed by two environment variables parsed in train_gpt.py at lines 73-84:
| Hyperparameter | Purpose | Default |
|---|---|---|
MUON_MOMENTUM_WARMUP_START |
Momentum value at step 0 | 0.85 |
MUON_MOMENTUM_WARMUP_STEPS |
Number of steps to reach target momentum | 500 |
These values are stored in the Hyperparameters class and accessed via args.muon_momentum_warmup_start and args.muon_momentum_warmup_steps during the training loop.
Calculating the Warmup Fraction
During each training step, the script computes a progress fraction frac that determines how much of the warmup has completed. This logic appears in train_gpt.py around lines 1021-1024:
frac = min(step / args.muon_momentum_warmup_steps, 1.0) \
if args.muon_momentum_warmup_steps > 0 else 1.0
The step variable represents the current training step starting at 0. The fraction is clamped at 1.0 to ensure momentum stops increasing once the warmup period concludes.
Linear Interpolation of Momentum Values
Using the computed fraction, the implementation performs linear interpolation between the start and target momentum values:
muon_momentum = (1 - frac) * args.muon_momentum_warmup_start \
+ frac * args.muon_momentum
This produces a smooth transition from the conservative MUON_MOMENTUM_WARMUP_START (typically 0.85) to the final MUON_MOMENTUM value (typically 0.95).
Updating the Optimizer State
The calculated momentum value is applied to every parameter group in the Muon optimizer immediately before the optimization step:
for group in optimizer_muon.param_groups:
group["momentum"] = muon_momentum
This per-step update ensures the optimizer uses the current warmup value for the upcoming parameter update, providing fine-grained control over the optimization dynamics.
Configuration Examples
Default Behavior
To use the default warmup schedule (0.85 → 0.95 over 500 steps), simply run the training script without modification:
python train_gpt.py
Custom Environment Variables
Adjust the warmup behavior by setting environment variables before execution:
MUON_MOMENTUM_WARMUP_START=0.9 \
MUON_MOMENTUM_WARMUP_STEPS=1000 \
python train_gpt.py
This configuration starts momentum at 0.9 and extends the warmup period to 1000 steps for more gradual stabilization.
Debugging Momentum Values
To inspect momentum values during training, add debugging output after the warmup computation in train_gpt.py:
# After the muon_momentum calculation
print(f"[debug] step={step} momentum={muon_momentum:.4f}")
Implementation in MLX Backend
The same warmup logic appears in train_gpt_mlx.py with a different default duration of 1500 steps, demonstrating the framework-agnostic nature of the schedule. The calculation formulas remain identical, ensuring consistent behavior across PyTorch and MLX implementations.
Summary
- The Muon momentum warmup schedule linearly interpolates from
MUON_MOMENTUM_WARMUP_START(default 0.85) to the target momentum overMUON_MOMENTUM_WARMUP_STEPS(default 500). - The fraction calculation clamps at 1.0 to maintain final momentum after warmup completion.
- Each training step updates
optimizer_muon.param_groupswith the computed momentum value before the optimizer step executes. - Configuration requires no code changes—set environment variables to customize behavior.
- The implementation is framework-agnostic, appearing in both
train_gpt.pyandtrain_gpt_mlx.py.
Frequently Asked Questions
What is the default momentum warmup schedule in Parameter-Golf?
The default schedule starts momentum at 0.85 and linearly increases it to the target value (typically 0.95) over 500 training steps. This gradual ramp prevents early training instability while allowing the Muon optimizer to reach full momentum quickly.
How do I disable the Muon momentum warmup?
Set MUON_MOMENTUM_WARMUP_STEPS to 0 or a negative value. When args.muon_momentum_warmup_steps is less than or equal to 0, the fraction calculation immediately returns 1.0, causing the optimizer to use the final target momentum from the first step.
Why does the momentum warmup use linear interpolation?
Linear interpolation provides a predictable, smooth transition between the conservative starting momentum and the aggressive target momentum. This approach balances early training stability (lower momentum reduces noise in gradient updates) with later convergence speed (higher momentum accelerates optimization in consistent gradient directions).
Does the warmup schedule work with the MLX backend?
Yes, the identical warmup logic is implemented in train_gpt_mlx.py with the same linear interpolation formulas and environment variable configuration. The only difference is the default warmup duration, which is set to 1500 steps in the MLX version rather than 500 steps in the PyTorch version.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →