Optimal Hyperparameters for Training Under 10 Minutes: The Complete Parameter Golf Record
The optimal hyperparameters for training under 10 minutes on 8×H100 GPUs combine an 11-layer transformer with 3-layer depth recurrence, MuonEq-R optimizer, and aggressive int6 quantization, achieving 1.0810 bits-per-byte in approximately 588 seconds.
The openai/parameter-golf repository tracks extreme optimization challenges for training small language models under strict time and size constraints. The current 10-minute record leverages architectural innovations and training dynamics specifically tuned for the 8×H100 SXM hardware configuration.
Record-Breaking Architecture Configuration
The winning submission uses the SP8192 configuration with structural modifications that maximize parameter efficiency without exceeding the 16 MB artifact limit.
Model Dimensions and Tokenization
The architecture employs an 8192-token vocabulary with compact hidden dimensions:
- 512 dimensional model (
d_model=512) - 11 physical transformer layers
- 8 attention heads with 4 KV heads (GQA)
- MLP ratio 4× (
d_ff=2048) - LeakyReLU² activation
- Partial RoPE (16/64 dimensions)
These settings are defined in train_gpt.py within the record submission directory.
Depth Recurrence and Parallel Residuals
The model achieves 17 virtual layers from 11 physical layers through depth recurrence:
- 3-layer recurrence applied to layers 3-5, activated at fractional step 0.35
- Parallel residuals enabled from layer 7 onward, allowing attention and MLP blocks to share the same pre-residual input
These modifications reduce computational overhead while maintaining model capacity critical for the 10-minute window.
Optimizer and Training Dynamics
The record relies on a hybrid optimizer strategy and aggressive learning rate scheduling to converge within 4550 steps.
MuonEq-R and AdamW Hybrid
The MuonEq-R optimizer handles most parameters, featuring:
- Row-normalised Muon with 5 Newton-Schulz steps
- Full µTransfer compatibility for hyperparameter stability
AdamW optimizes embedding and scalar parameters exclusively, providing stable updates for high-dimensional sparse gradients.
Learning Rate Schedule
The schedule maximizes early progress while ensuring late-stage convergence:
- Base learning rate: 0.005 (for TTT components)
- Cosine decay with linear warm-down to 0 over the final 72% of training
- Training steps: 4550 (approximately 588 seconds on 8×H100)
Regularization Parameters
Critical numerical settings from the record's configuration:
- Weight decay: 0.095
- Max learning rate (Adam): 0.022
- EMA decay: 0.9965
- Warm-down factor: 0.72
Test-Time Training (TTT) Configuration
The "Legal Score-First" TTT strategy provides post-training adaptation without violating causality constraints.
The configuration uses:
- 3 epochs per 32K-token chunk
- SGD with learning rate 0.005 and momentum 0.9
- Cosine decay schedule
- Gradient clipping at 1.0
This TTT approach is specifically designated as "legal" within the parameter-golf rules, allowing score improvements without additional training time penalties.
Quantization and Compression Strategy
The record achieves the 16 MB size limit through aggressive quantization and post-training compression.
GPTQ with SDClip
The quantization pipeline employs:
- Full-Hessian GPTQ with SDClip protection
- int6 precision for attention and MLP weights
- int8 precision for embedding tables
- Byte-shuffle preprocessing and Brotli-11 compression
- Zero-selective pruning to remove redundant parameters
LZMA Compression Wrapper
A final LZMA code wrapper (~16 KB) reduces the artifact by approximately 43 KB, ensuring the final submission fits within the strict 15.99 MB limit.
Reproducing the 10-Minute Run
Execute the following commands to replicate the record on 8×H100 GPUs:
# Install required dependencies
pip install brotli sentencepiece \
flash_attn_3 --no-deps --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291/
# Download SP8192 tokenizer and FineWeb shards
MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf \
python3 data/cached_challenge_fineweb.py --variant sp8192
# Launch training with optimal hyperparameters
SEED=42 \
QK_GAIN_INIT=5.25 \
TTT_ENABLED=1 TTT_LR=0.005 TTT_EPOCHS=3 \
torchrun --standalone --nproc_per_node=8 train_gpt.py
The train_gpt.py script in the record directory contains the default values for weight decay (0.095), MLR (0.022), EMA decay (0.9965), and architectural configurations.
Summary
- Architecture: 11-layer transformer with 3-layer depth recurrence and parallel residuals yields 17 virtual layers within 16 MB.
- Optimization: MuonEq-R hybrid with AdamW, base LR 0.005, and 72% warm-down achieves convergence in 4550 steps (~588 seconds).
- Regularization: Weight decay 0.095, EMA decay 0.9965, and gradient clipping stabilize the aggressive training schedule.
- Post-Processing: Legal score-first TTT (3 epochs, SGD 0.005) and int6/int8 GPTQ quantization with LZMA compression fit the model into 15.99 MB while achieving 1.0810 BPP.
Frequently Asked Questions
What hardware is required to reproduce the 10-minute training record?
The record requires 8× NVIDIA H100 SXM GPUs with high-bandwidth interconnects. The 4550 training steps complete in approximately 588 seconds on this configuration, fitting within the strict 10-minute window. While other hardware may run the code, achieving the same wall-clock time requires equivalent computational throughput.
Why does the record use MuonEq-R instead of standard AdamW?
MuonEq-R (row-normalized Muon with 5 Newton-Schulz steps) provides faster convergence on low-precision weights compared to standard AdamW. The hybrid approach uses MuonEq-R for most parameters while reserving AdamW specifically for embeddings and scalars, ensuring stable updates for high-dimensional sparse gradients that Muon handles poorly.
How does depth recurrence improve model capacity without adding parameters?
The 3-layer recurrence (layers 3-5) reuses physical layer weights at fractional step 0.35, effectively creating 17 virtual layers from 11 physical layers. This technique multiplies effective depth without increasing parameter count or memory usage, satisfying the parameter-golf constraint of fitting within 16 MB while maintaining architectural capacity.
What is the purpose of the QK-Gain initialization at 5.25?
The QK-Gain (learnable per-head query scaling) initialized at 5.25 scales attention logits to compensate for the partial RoPE configuration (16/64 dimensions) and the depth recurrence structure. This specific initialization value prevents attention entropy collapse during the early stages of the aggressive training schedule, ensuring stable gradient flow through the 17 effective layers.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →