# Optimal Hyperparameters for Training Under 10 Minutes: The Complete Parameter Golf Record

> Discover optimal hyperparameters for training under 10 minutes. Achieve 1.0810 bits-per-byte using an 11-layer transformer, MuonEq-R, and int6 quantization on 8xH100 GPUs.

- Repository: [OpenAI/parameter-golf](https://github.com/openai/parameter-golf)
- Tags: deep-dive
- Published: 2026-04-17

---

**The optimal hyperparameters for training under 10 minutes on 8×H100 GPUs combine an 11-layer transformer with 3-layer depth recurrence, MuonEq-R optimizer, and aggressive int6 quantization, achieving 1.0810 bits-per-byte in approximately 588 seconds.**

The `openai/parameter-golf` repository tracks extreme optimization challenges for training small language models under strict time and size constraints. The current 10-minute record leverages architectural innovations and training dynamics specifically tuned for the 8×H100 SXM hardware configuration.

## Record-Breaking Architecture Configuration

The winning submission uses the **SP8192** configuration with structural modifications that maximize parameter efficiency without exceeding the 16 MB artifact limit.

### Model Dimensions and Tokenization

The architecture employs an 8192-token vocabulary with compact hidden dimensions:

- **512** dimensional model (`d_model=512`)
- **11** physical transformer layers
- **8** attention heads with **4** KV heads (GQA)
- **MLP ratio 4×** (`d_ff=2048`)
- **LeakyReLU²** activation
- **Partial RoPE** (16/64 dimensions)

These settings are defined in [`train_gpt.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt.py) within the record submission directory.

### Depth Recurrence and Parallel Residuals

The model achieves **17 virtual layers** from 11 physical layers through depth recurrence:

- **3-layer recurrence** applied to layers 3-5, activated at fractional step **0.35**
- **Parallel residuals** enabled from layer 7 onward, allowing attention and MLP blocks to share the same pre-residual input

These modifications reduce computational overhead while maintaining model capacity critical for the 10-minute window.

## Optimizer and Training Dynamics

The record relies on a hybrid optimizer strategy and aggressive learning rate scheduling to converge within 4550 steps.

### MuonEq-R and AdamW Hybrid

The **MuonEq-R** optimizer handles most parameters, featuring:

- Row-normalised Muon with **5 Newton-Schulz steps**
- Full µTransfer compatibility for hyperparameter stability

**AdamW** optimizes embedding and scalar parameters exclusively, providing stable updates for high-dimensional sparse gradients.

### Learning Rate Schedule

The schedule maximizes early progress while ensuring late-stage convergence:

- **Base learning rate**: 0.005 (for TTT components)
- **Cosine decay** with linear warm-down to 0 over the final **72%** of training
- **Training steps**: **4550** (approximately 588 seconds on 8×H100)

### Regularization Parameters

Critical numerical settings from the record's configuration:

- **Weight decay**: **0.095**
- **Max learning rate (Adam)**: **0.022**
- **EMA decay**: **0.9965**
- **Warm-down factor**: **0.72**

## Test-Time Training (TTT) Configuration

The "Legal Score-First" TTT strategy provides post-training adaptation without violating causality constraints.

The configuration uses:

- **3 epochs** per 32K-token chunk
- **SGD** with learning rate **0.005** and momentum **0.9**
- **Cosine decay** schedule
- **Gradient clipping** at **1.0**

This TTT approach is specifically designated as "legal" within the parameter-golf rules, allowing score improvements without additional training time penalties.

## Quantization and Compression Strategy

The record achieves the 16 MB size limit through aggressive quantization and post-training compression.

### GPTQ with SDClip

The quantization pipeline employs:

- **Full-Hessian GPTQ** with **SDClip** protection
- **int6** precision for attention and MLP weights
- **int8** precision for embedding tables
- **Byte-shuffle** preprocessing and **Brotli-11** compression
- **Zero-selective pruning** to remove redundant parameters

### LZMA Compression Wrapper

A final **LZMA** code wrapper (~16 KB) reduces the artifact by approximately **43 KB**, ensuring the final submission fits within the strict 15.99 MB limit.

## Reproducing the 10-Minute Run

Execute the following commands to replicate the record on 8×H100 GPUs:

```bash

# Install required dependencies

pip install brotli sentencepiece \
    flash_attn_3 --no-deps --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291/

# Download SP8192 tokenizer and FineWeb shards

MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf \
python3 data/cached_challenge_fineweb.py --variant sp8192

# Launch training with optimal hyperparameters

SEED=42 \
QK_GAIN_INIT=5.25 \
TTT_ENABLED=1 TTT_LR=0.005 TTT_EPOCHS=3 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

```

The [`train_gpt.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt.py) script in the record directory contains the default values for weight decay (0.095), MLR (0.022), EMA decay (0.9965), and architectural configurations.

## Summary

- **Architecture**: 11-layer transformer with 3-layer depth recurrence and parallel residuals yields 17 virtual layers within 16 MB.
- **Optimization**: MuonEq-R hybrid with AdamW, base LR 0.005, and 72% warm-down achieves convergence in 4550 steps (~588 seconds).
- **Regularization**: Weight decay 0.095, EMA decay 0.9965, and gradient clipping stabilize the aggressive training schedule.
- **Post-Processing**: Legal score-first TTT (3 epochs, SGD 0.005) and int6/int8 GPTQ quantization with LZMA compression fit the model into 15.99 MB while achieving 1.0810 BPP.

## Frequently Asked Questions

### What hardware is required to reproduce the 10-minute training record?

The record requires **8× NVIDIA H100 SXM GPUs** with high-bandwidth interconnects. The 4550 training steps complete in approximately 588 seconds on this configuration, fitting within the strict 10-minute window. While other hardware may run the code, achieving the same wall-clock time requires equivalent computational throughput.

### Why does the record use MuonEq-R instead of standard AdamW?

**MuonEq-R** (row-normalized Muon with 5 Newton-Schulz steps) provides faster convergence on low-precision weights compared to standard AdamW. The hybrid approach uses MuonEq-R for most parameters while reserving AdamW specifically for embeddings and scalars, ensuring stable updates for high-dimensional sparse gradients that Muon handles poorly.

### How does depth recurrence improve model capacity without adding parameters?

The **3-layer recurrence** (layers 3-5) reuses physical layer weights at fractional step 0.35, effectively creating **17 virtual layers** from 11 physical layers. This technique multiplies effective depth without increasing parameter count or memory usage, satisfying the parameter-golf constraint of fitting within 16 MB while maintaining architectural capacity.

### What is the purpose of the QK-Gain initialization at 5.25?

The **QK-Gain** (learnable per-head query scaling) initialized at **5.25** scales attention logits to compensate for the partial RoPE configuration (16/64 dimensions) and the depth recurrence structure. This specific initialization value prevents attention entropy collapse during the early stages of the aggressive training schedule, ensuring stable gradient flow through the 17 effective layers.