# How the Learning Rate Warmdown Schedule Works in OpenAI Parameter Golf

> Learn how the learning rate warmdown schedule in OpenAI parameter golf stabilizes convergence with linear decay. Understand its final training steps, wall-clock time, and budget fraction.

- Repository: [OpenAI/parameter-golf](https://github.com/openai/parameter-golf)
- Tags: deep-dive
- Published: 2026-04-17

---

**The learning rate warmdown schedule in OpenAI's parameter-golf repository computes a linear decay multiplier that ramps down from 1.0 to 0.0 over the final training steps, wall-clock time, or a configurable fraction of the budget to stabilize convergence.**

The parameter-golf repository implements a unique approach to learning rate scheduling that deviates from traditional cosine annealing. Instead of cooling down throughout training, the system maintains a constant learning rate until the final phase, then executes a **learning rate warmdown schedule** that linearly decays the optimizer's step size to zero. This technique helps models stabilize during the critical final convergence phase without sacrificing early training speed.

## Core Implementation in train_gpt.py

The canonical implementation resides in **[`train_gpt.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt.py)**, where the `lr_mul` function calculates the per-step learning rate multiplier. This function implements dual-mode logic that switches between iteration-based and wall-clock-based warmdown depending on whether a maximum runtime is specified.

```python
def lr_mul(step: int, elapsed_ms: float) -> float:
    # No warmdown requested

    if args.warmdown_iters <= 0:
        return 1.0

    # -----------------------------------------------------------------

    # 1️⃣ Iteration-based warmdown (no wall-clock budget)

    # -----------------------------------------------------------------

    if max_wallclock_ms is None:
        warmdown_start = max(args.iterations - args.warmdown_iters, 0)
        # Linear decay from 1 → 0 over the final `warmdown_iters` steps

        if warmdown_start <= step < args.iterations:
            return max((args.iterations - step) / max(args.warmdown_iters, 1), 0.0)
        return 1.0

    # -----------------------------------------------------------------

    # 2️⃣ Wall-clock-based warmdown (budget supplied)

    # -----------------------------------------------------------------

    step_ms = elapsed_ms / max(step, 1)                # avg ms per step so far

    warmdown_ms = args.warmdown_iters * step_ms       # duration of warm-down window

    remaining_ms = max(max_wallclock_ms - elapsed_ms, 0.0)

    # If we have entered the warm-down window, decay proportionally

    if remaining_ms <= warmdown_ms:
        return remaining_ms / max(warmdown_ms, 1e-9)
    return 1.0

```

After computing the multiplier, the training loop applies it to every optimizer's parameter groups:

```python
for opt in optimizers:
    for group in opt.param_groups:
        group["lr"] = group["base_lr"] * scale

```

## Three Learning Rate Warmdown Strategies

The repository supports three distinct flavors of the **learning rate warmdown schedule**, selected via configuration flags or environment variables.

### Iteration-Based Learning Rate Warmdown

The default strategy triggers a linear decay over the final `warmdown_iters` training steps. When `args.max_wallclock_seconds` is omitted, the system calculates the warmdown start point as `max(args.iterations - args.warmdown_iters, 0)`. The multiplier falls linearly from 1.0 to 0.0 during this window, ensuring the optimizer takes smaller steps as training concludes.

### Wall-Clock-Based Learning Rate Warmdown

When training under a strict time budget using `--max_wallclock_seconds`, the **learning rate warmdown schedule** adapts dynamically. The system estimates the average time per step (`step_ms`) and calculates the warmdown window duration as `warmdown_iters * step_ms`. Once the remaining wall-clock time drops below this threshold, the multiplier linearly decays to zero based on the remaining milliseconds rather than remaining steps.

### Fraction-Based Learning Rate Warmdown

Advanced experiment scripts such as **[`train_gpt_cuda_binary.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt_cuda_binary.py)** and **[`train_gpt_decode.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt_decode.py)** support fraction-based configuration via the `WARMDOWN_FRACTION` or `WARMDOWN_FRAC` environment variables. This approach computes the warmdown start as a proportion of the total training budget:

```python
warmdown_start = int(args.iterations * (1.0 - args.warmdown_fraction))

```

This variant may also clamp to a configurable `min_lr` to prevent the learning rate from dropping below a functional threshold.

## Configuring the Learning Rate Warmdown Schedule

Enable and customize the **learning rate warmdown schedule** using environment variables and command-line flags.

**Basic iteration-based configuration:**

```bash

# Warmdown over the final 3000 steps of a 20k-step run

WARMDOWN_ITERS=3000 python train_gpt.py \
    --iterations 20000 \
    --train_batch_tokens 524288 \
    --train_seq_len 1024

```

**Wall-clock-based configuration:**

```bash

# 10-minute training cap with warmdown starting 2 minutes before the deadline

MAX_WALLCLOCK_SECONDS=600 WARMDOWN_ITERS=1200 python train_gpt.py \
    --iterations 20000 \
    --train_batch_tokens 524288 \
    --train_seq_len 1024

```

**Fraction-based configuration:**

```bash

# Warmdown over the final 30% of training steps

WARMDOWN_FRACTION=0.30 python train_gpt_cuda_binary.py \
    --iterations 20000 \
    --train_batch_tokens 524288

```

## Key Source Files and Functions

| File | Description | Key Function |
|------|-------------|--------------|
| [`train_gpt.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt.py) | Main training entry point containing the canonical `lr_mul` implementation | `lr_mul(step: int, elapsed_ms: float) -> float` |
| [`train_gpt_mlx.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt_mlx.py) | Apple Silicon (MLX) variant mirroring the warmdown logic | `lr_mul` (equivalent implementation) |
| [`records/track_non_record_16mb/.../train_gpt_cuda_binary.py`](https://github.com/openai/parameter-golf/blob/main/records/track_non_record_16mb/.../train_gpt_cuda_binary.py) | Binary training experiments with fraction-based warmdown | Fraction calculation using `WARMDOWN_FRACTION` |
| [`records/track_10min_16mb/.../train_gpt_decode.py`](https://github.com/openai/parameter-golf/blob/main/records/track_10min_16mb/.../train_gpt_decode.py) | Decoding experiments with `WARMDOWN_FRAC` and `min_lr` clamping | Fraction-based warmdown with minimum LR floor |

## Summary

- The **learning rate warmdown schedule** in parameter-golf applies a linear decay multiplier to optimizer learning rates during the final phase of training.
- The `lr_mul` function in **[`train_gpt.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt.py)** implements the core logic, supporting both iteration-based and wall-clock-based warmdown strategies.
- **Iteration-based warmdown** decays over a fixed number of final steps (`WARMDOWN_ITERS`), while **wall-clock-based warmdown** dynamically calculates the decay window based on remaining time and average step latency.
- **Fraction-based variants** (`WARMDOWN_FRACTION`) allow specifying the warmdown window as a percentage of total training budget, optionally clamping to a minimum learning rate.
- The schedule is enabled by setting `WARMDOWN_ITERS` or related environment variables, and the multiplier is applied to each optimizer's `base_lr` every training step.

## Frequently Asked Questions

### How do I enable the learning rate warmdown schedule in parameter-golf?

Set the `WARMDOWN_ITERS` environment variable or pass `--warmdown_iters` as a command-line argument when running [`train_gpt.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt.py). For example, `WARMDOWN_ITERS=3000 python train_gpt.py --iterations 20000` enables a 3000-step linear decay during the final phase of training.

### What is the difference between iteration-based and wall-clock-based warmdown?

**Iteration-based warmdown** calculates the decay window as a fixed number of training steps (`warmdown_iters`) before the total iteration count ends. **Wall-clock-based warmdown** instead monitors elapsed time and estimates the average milliseconds per step to determine when to start decaying based on the remaining wall-clock budget, making it ideal for time-constrained training runs.

### Can I use a fractional warmdown window instead of fixed iterations?

Yes. Experimental scripts like [`train_gpt_cuda_binary.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt_cuda_binary.py) support the `WARMDOWN_FRACTION` or `WARMDOWN_FRAC` environment variables. These interpret the warmdown window as a fraction of the total training budget (e.g., `WARMDOWN_FRACTION=0.30` warms down over the final 30% of steps), often with an optional `min_lr` floor to prevent the learning rate from dropping too low.