How the Learning Rate Warmdown Schedule Works in OpenAI Parameter Golf

The learning rate warmdown schedule in OpenAI's parameter-golf repository computes a linear decay multiplier that ramps down from 1.0 to 0.0 over the final training steps, wall-clock time, or a configurable fraction of the budget to stabilize convergence.

The parameter-golf repository implements a unique approach to learning rate scheduling that deviates from traditional cosine annealing. Instead of cooling down throughout training, the system maintains a constant learning rate until the final phase, then executes a learning rate warmdown schedule that linearly decays the optimizer's step size to zero. This technique helps models stabilize during the critical final convergence phase without sacrificing early training speed.

Core Implementation in train_gpt.py

The canonical implementation resides in train_gpt.py, where the lr_mul function calculates the per-step learning rate multiplier. This function implements dual-mode logic that switches between iteration-based and wall-clock-based warmdown depending on whether a maximum runtime is specified.

def lr_mul(step: int, elapsed_ms: float) -> float:
    # No warmdown requested

    if args.warmdown_iters <= 0:
        return 1.0

    # -----------------------------------------------------------------

    # 1️⃣ Iteration-based warmdown (no wall-clock budget)

    # -----------------------------------------------------------------

    if max_wallclock_ms is None:
        warmdown_start = max(args.iterations - args.warmdown_iters, 0)
        # Linear decay from 1 → 0 over the final `warmdown_iters` steps

        if warmdown_start <= step < args.iterations:
            return max((args.iterations - step) / max(args.warmdown_iters, 1), 0.0)
        return 1.0

    # -----------------------------------------------------------------

    # 2️⃣ Wall-clock-based warmdown (budget supplied)

    # -----------------------------------------------------------------

    step_ms = elapsed_ms / max(step, 1)                # avg ms per step so far

    warmdown_ms = args.warmdown_iters * step_ms       # duration of warm-down window

    remaining_ms = max(max_wallclock_ms - elapsed_ms, 0.0)

    # If we have entered the warm-down window, decay proportionally

    if remaining_ms <= warmdown_ms:
        return remaining_ms / max(warmdown_ms, 1e-9)
    return 1.0

After computing the multiplier, the training loop applies it to every optimizer's parameter groups:

for opt in optimizers:
    for group in opt.param_groups:
        group["lr"] = group["base_lr"] * scale

Three Learning Rate Warmdown Strategies

The repository supports three distinct flavors of the learning rate warmdown schedule, selected via configuration flags or environment variables.

Iteration-Based Learning Rate Warmdown

The default strategy triggers a linear decay over the final warmdown_iters training steps. When args.max_wallclock_seconds is omitted, the system calculates the warmdown start point as max(args.iterations - args.warmdown_iters, 0). The multiplier falls linearly from 1.0 to 0.0 during this window, ensuring the optimizer takes smaller steps as training concludes.

Wall-Clock-Based Learning Rate Warmdown

When training under a strict time budget using --max_wallclock_seconds, the learning rate warmdown schedule adapts dynamically. The system estimates the average time per step (step_ms) and calculates the warmdown window duration as warmdown_iters * step_ms. Once the remaining wall-clock time drops below this threshold, the multiplier linearly decays to zero based on the remaining milliseconds rather than remaining steps.

Fraction-Based Learning Rate Warmdown

Advanced experiment scripts such as train_gpt_cuda_binary.py and train_gpt_decode.py support fraction-based configuration via the WARMDOWN_FRACTION or WARMDOWN_FRAC environment variables. This approach computes the warmdown start as a proportion of the total training budget:

warmdown_start = int(args.iterations * (1.0 - args.warmdown_fraction))

This variant may also clamp to a configurable min_lr to prevent the learning rate from dropping below a functional threshold.

Configuring the Learning Rate Warmdown Schedule

Enable and customize the learning rate warmdown schedule using environment variables and command-line flags.

Basic iteration-based configuration:


# Warmdown over the final 3000 steps of a 20k-step run

WARMDOWN_ITERS=3000 python train_gpt.py \
    --iterations 20000 \
    --train_batch_tokens 524288 \
    --train_seq_len 1024

Wall-clock-based configuration:


# 10-minute training cap with warmdown starting 2 minutes before the deadline

MAX_WALLCLOCK_SECONDS=600 WARMDOWN_ITERS=1200 python train_gpt.py \
    --iterations 20000 \
    --train_batch_tokens 524288 \
    --train_seq_len 1024

Fraction-based configuration:


# Warmdown over the final 30% of training steps

WARMDOWN_FRACTION=0.30 python train_gpt_cuda_binary.py \
    --iterations 20000 \
    --train_batch_tokens 524288

Key Source Files and Functions

File Description Key Function
train_gpt.py Main training entry point containing the canonical lr_mul implementation lr_mul(step: int, elapsed_ms: float) -> float
train_gpt_mlx.py Apple Silicon (MLX) variant mirroring the warmdown logic lr_mul (equivalent implementation)
records/track_non_record_16mb/.../train_gpt_cuda_binary.py Binary training experiments with fraction-based warmdown Fraction calculation using WARMDOWN_FRACTION
records/track_10min_16mb/.../train_gpt_decode.py Decoding experiments with WARMDOWN_FRAC and min_lr clamping Fraction-based warmdown with minimum LR floor

Summary

  • The learning rate warmdown schedule in parameter-golf applies a linear decay multiplier to optimizer learning rates during the final phase of training.
  • The lr_mul function in train_gpt.py implements the core logic, supporting both iteration-based and wall-clock-based warmdown strategies.
  • Iteration-based warmdown decays over a fixed number of final steps (WARMDOWN_ITERS), while wall-clock-based warmdown dynamically calculates the decay window based on remaining time and average step latency.
  • Fraction-based variants (WARMDOWN_FRACTION) allow specifying the warmdown window as a percentage of total training budget, optionally clamping to a minimum learning rate.
  • The schedule is enabled by setting WARMDOWN_ITERS or related environment variables, and the multiplier is applied to each optimizer's base_lr every training step.

Frequently Asked Questions

How do I enable the learning rate warmdown schedule in parameter-golf?

Set the WARMDOWN_ITERS environment variable or pass --warmdown_iters as a command-line argument when running train_gpt.py. For example, WARMDOWN_ITERS=3000 python train_gpt.py --iterations 20000 enables a 3000-step linear decay during the final phase of training.

What is the difference between iteration-based and wall-clock-based warmdown?

Iteration-based warmdown calculates the decay window as a fixed number of training steps (warmdown_iters) before the total iteration count ends. Wall-clock-based warmdown instead monitors elapsed time and estimates the average milliseconds per step to determine when to start decaying based on the remaining wall-clock budget, making it ideal for time-constrained training runs.

Can I use a fractional warmdown window instead of fixed iterations?

Yes. Experimental scripts like train_gpt_cuda_binary.py support the WARMDOWN_FRACTION or WARMDOWN_FRAC environment variables. These interpret the warmdown window as a fraction of the total training budget (e.g., WARMDOWN_FRACTION=0.30 warms down over the final 30% of steps), often with an optional min_lr floor to prevent the learning rate from dropping too low.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →