deep-dive

Optimal Hyperparameters for Training Under 10 Minutes: The Complete Parameter Golf Record

April 17, 2026 openai/parameter-golf ↗

The optimal hyperparameters for training under 10 minutes on 8×H100 GPUs combine an 11-layer transformer with 3-layer depth recurrence, MuonEq-R optimizer, and aggressive int6 quantization, achieving 1.0810 bits-per-byte in approximately 588 seconds.

The openai/parameter-golf repository tracks extreme optimization challenges for training small language models under strict time and size constraints. The current 10-minute record leverages architectural innovations and training dynamics specifically tuned for the 8×H100 SXM hardware configuration.

Record-Breaking Architecture Configuration

The winning submission uses the SP8192 configuration with structural modifications that maximize parameter efficiency without exceeding the 16 MB artifact limit.

Model Dimensions and Tokenization

The architecture employs an 8192-token vocabulary with compact hidden dimensions:

512 dimensional model (d_model=512)
11 physical transformer layers
8 attention heads with 4 KV heads (GQA)
MLP ratio 4× (d_ff=2048)
LeakyReLU² activation
Partial RoPE (16/64 dimensions)

These settings are defined in train_gpt.py within the record submission directory.

Depth Recurrence and Parallel Residuals

The model achieves 17 virtual layers from 11 physical layers through depth recurrence:

3-layer recurrence applied to layers 3-5, activated at fractional step 0.35
Parallel residuals enabled from layer 7 onward, allowing attention and MLP blocks to share the same pre-residual input

These modifications reduce computational overhead while maintaining model capacity critical for the 10-minute window.

Optimizer and Training Dynamics

The record relies on a hybrid optimizer strategy and aggressive learning rate scheduling to converge within 4550 steps.

MuonEq-R and AdamW Hybrid

The MuonEq-R optimizer handles most parameters, featuring:

Row-normalised Muon with 5 Newton-Schulz steps
Full µTransfer compatibility for hyperparameter stability

AdamW optimizes embedding and scalar parameters exclusively, providing stable updates for high-dimensional sparse gradients.

Learning Rate Schedule

The schedule maximizes early progress while ensuring late-stage convergence:

Base learning rate: 0.005 (for TTT components)
Cosine decay with linear warm-down to 0 over the final 72% of training
Training steps: 4550 (approximately 588 seconds on 8×H100)

Regularization Parameters

Critical numerical settings from the record's configuration:

Weight decay: 0.095
Max learning rate (Adam): 0.022
EMA decay: 0.9965
Warm-down factor: 0.72

Test-Time Training (TTT) Configuration

The "Legal Score-First" TTT strategy provides post-training adaptation without violating causality constraints.

The configuration uses:

3 epochs per 32K-token chunk
SGD with learning rate 0.005 and momentum 0.9
Cosine decay schedule
Gradient clipping at 1.0

This TTT approach is specifically designated as "legal" within the parameter-golf rules, allowing score improvements without additional training time penalties.

Quantization and Compression Strategy

The record achieves the 16 MB size limit through aggressive quantization and post-training compression.

GPTQ with SDClip

The quantization pipeline employs:

Full-Hessian GPTQ with SDClip protection
int6 precision for attention and MLP weights
int8 precision for embedding tables
Byte-shuffle preprocessing and Brotli-11 compression
Zero-selective pruning to remove redundant parameters

LZMA Compression Wrapper

A final LZMA code wrapper (~16 KB) reduces the artifact by approximately 43 KB, ensuring the final submission fits within the strict 15.99 MB limit.

Reproducing the 10-Minute Run

Execute the following commands to replicate the record on 8×H100 GPUs:


# Install required dependencies

pip install brotli sentencepiece \
    flash_attn_3 --no-deps --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291/

# Download SP8192 tokenizer and FineWeb shards

MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf \
python3 data/cached_challenge_fineweb.py --variant sp8192

# Launch training with optimal hyperparameters

SEED=42 \
QK_GAIN_INIT=5.25 \
TTT_ENABLED=1 TTT_LR=0.005 TTT_EPOCHS=3 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

The train_gpt.py script in the record directory contains the default values for weight decay (0.095), MLR (0.022), EMA decay (0.9965), and architectural configurations.

Summary

Architecture: 11-layer transformer with 3-layer depth recurrence and parallel residuals yields 17 virtual layers within 16 MB.
Optimization: MuonEq-R hybrid with AdamW, base LR 0.005, and 72% warm-down achieves convergence in 4550 steps (~588 seconds).
Regularization: Weight decay 0.095, EMA decay 0.9965, and gradient clipping stabilize the aggressive training schedule.
Post-Processing: Legal score-first TTT (3 epochs, SGD 0.005) and int6/int8 GPTQ quantization with LZMA compression fit the model into 15.99 MB while achieving 1.0810 BPP.

Frequently Asked Questions

What hardware is required to reproduce the 10-minute training record?

The record requires 8× NVIDIA H100 SXM GPUs with high-bandwidth interconnects. The 4550 training steps complete in approximately 588 seconds on this configuration, fitting within the strict 10-minute window. While other hardware may run the code, achieving the same wall-clock time requires equivalent computational throughput.

Why does the record use MuonEq-R instead of standard AdamW?

MuonEq-R (row-normalized Muon with 5 Newton-Schulz steps) provides faster convergence on low-precision weights compared to standard AdamW. The hybrid approach uses MuonEq-R for most parameters while reserving AdamW specifically for embeddings and scalars, ensuring stable updates for high-dimensional sparse gradients that Muon handles poorly.

How does depth recurrence improve model capacity without adding parameters?

The 3-layer recurrence (layers 3-5) reuses physical layer weights at fractional step 0.35, effectively creating 17 virtual layers from 11 physical layers. This technique multiplies effective depth without increasing parameter count or memory usage, satisfying the parameter-golf constraint of fitting within 16 MB while maintaining architectural capacity.

What is the purpose of the QK-Gain initialization at 5.25?

The QK-Gain (learnable per-head query scaling) initialized at 5.25 scales attention logits to compensate for the partial RoPE configuration (16/64 dimensions) and the depth recurrence structure. This specific initialization value prevents attention entropy collapse during the early stages of the aggressive training schedule, ensuring stable gradient flow through the 17 effective layers.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:

curl -s "https://instagit.com/install.md"

Add to your MCP client configuration:

{
  "mcpServers": {
    "instagit": {
      "command": "npx",
      "args": ["-y", "instagit@latest"]
    }
  }
}

Ask your agent:

"Use Instagit MCP to understand how openai/parameter-golf works."

Works with

Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →