# parameter-golf | OpenAI | Knowledge Base | Instagit

Train the smallest LM you can that fits in 16MB. Best model wins!

GitHub Stars: 4.9k

Repository: https://github.com/openai/parameter-golf

---

## Articles

### [How to Configure Flash Attention with Grouped Query Attention (GQA) in Parameter-Golf](/openai/parameter-golf/how-to-configure-flash-attention-grouped-query-attention-gqa)

Learn to configure Flash Attention with Grouped Query Attention (GQA) in Parameter-Golf for faster training. Enable Flash Attention and set num_kv_heads below num_heads.

- Tags: how-to-guide
- Published: 2026-04-17

### [Logit Softcap Transformation in OpenAI Parameter-Golf: PyTorch and MLX Implementation](/openai/parameter-golf/how-logit-softcap-transformation-implemented)

Discover the logit softcap transformation in OpenAI parameter-golf. Learn how this PyTorch and MLX implementation bounds logits to prevent instability and preserve prediction distribution. Optimize your models!

- Tags: internals
- Published: 2026-04-17

### [How Gradient Accumulation Works in an 8xH100 Distributed Setup](/openai/parameter-golf/how-gradient-accumulation-affect-8xh100-distributed-setup)

Discover how gradient accumulation optimizes an 8xH100 distributed setup. Learn how micro-batches and GPU processing maintain a constant effective batch size for superior performance.

- Tags: deep-dive
- Published: 2026-04-17

### [How Tied Embeddings Are Implemented and Initialized Efficiently in Parameter-Golf](/openai/parameter-golf/how-tied-embeddings-implemented-initialized-efficiently)

Learn how tied embeddings are efficiently implemented and initialized in parameter-golf. This technique halves memory usage by reusing input embedding weights for the language modeling head.

- Tags: deep-dive
- Published: 2026-04-17

### [How to Configure bfloat16 Mixed Precision Training in Parameter-Golf](/openai/parameter-golf/how-to-configure-bfloat16-mixed-precision-training)

Configure bfloat16 mixed precision training in openai parameter-golf. Learn how to speed up models with PyTorch autocast or MLX backend settings for efficient deep learning.

- Tags: how-to-guide
- Published: 2026-04-17

### [How to Configure Partial Rotary Position Embedding (RoPE) in OpenAI Parameter Golf](/openai/parameter-golf/how-to-configure-partial-rotary-position-embedding-rope)

Learn to configure partial Rotary Position Embedding (RoPE) in OpenAI Parameter Golf. Reduce computation by applying RoPE to a subset of dimensions using rope_dims for improved efficiency.

- Tags: how-to-guide
- Published: 2026-04-17

### [LeakyReLU Squared Activation Function Implementation in OpenAI Parameter-Golf](/openai/parameter-golf/how-leakyrelu-squared-activation-function-implemented)

Discover how the LeakyReLU squared activation function is implemented in OpenAI's parameter-golf repository. Learn the efficient two-step inline operation.

- Tags: internals
- Published: 2026-04-17

### [How LZMA Code Compression Reduces Submission Artifact Size in Parameter Golf](/openai/parameter-golf/how-lzma-code-compression-reduce-submission-artifact-size)

Discover how LZMA code compression slashes submission artifact size by 39% for OpenAI parameter golf. Learn about high-ratio entropy coding and quantized weight streams enabling smaller wrappers.

- Tags: deep-dive
- Published: 2026-04-17

### [Exponential Moving Average (EMA) Implementation for Small Models in Parameter-Golf](/openai/parameter-golf/how-exponential-moving-average-ema-implemented-small-models)

Learn how Exponential Moving Average EMA is implemented for small models in Parameter-Golf. Discover its lightweight dictionary-based system.

- Tags: internals
- Published: 2026-04-17

### [How the Learning Rate Warmdown Schedule Works in OpenAI Parameter Golf](/openai/parameter-golf/how-learning-rate-warmdown-schedule-work)

Learn how the learning rate warmdown schedule in OpenAI parameter golf stabilizes convergence with linear decay. Understand its final training steps, wall-clock time, and budget fraction.

- Tags: deep-dive
- Published: 2026-04-17

### [Optimal Hyperparameters for Training Under 10 Minutes: The Complete Parameter Golf Record](/openai/parameter-golf/optimal-hyperparameters-training-under-10-minutes)

Discover optimal hyperparameters for training under 10 minutes. Achieve 1.0810 bits-per-byte using an 11-layer transformer, MuonEq-R, and int6 quantization on 8xH100 GPUs.

- Tags: deep-dive
- Published: 2026-04-17

### [How Skip Weights Enable Depth Recurrence in Parameter-Golf's GPT Architecture](/openai/parameter-golf/how-skip-weights-enable-depth-recurrence)

Discover how skip weights enable depth recurrence in Parameter-Golf's GPT architecture. Learn how learned scalars modulate encoder activations for dynamic revisiting of depth representations.

- Tags: internals
- Published: 2026-04-17

### [How the Muon Momentum Warmup Schedule Works in Parameter-Golf](/openai/parameter-golf/how-muon-momentum-warmup-schedule-implemented)

Discover how the Muon momentum warmup schedule works in Parameter-Golf. Learn its linear interpolation implementation and default values for optimizer momentum.

- Tags: internals
- Published: 2026-04-17

### [How Distributed Training with DDP and Gradient Accumulation Works in Parameter-Golf](/openai/parameter-golf/how-distributed-training-ddp-gradient-accumulation-work)

Explore distributed training with DDP and gradient accumulation in parameter-golf. Scale batch sizes across GPUs while maintaining low per-GPU memory usage.

- Tags: how-to-guide
- Published: 2026-04-17

### [How to Implement Learnable QK-Gain Scaling for Attention Heads in parameter-golf](/openai/parameter-golf/how-to-implement-learnable-qk-gain-scaling-attention-heads)

Implement learnable QK-Gain scaling for attention heads in parameter-golf. Discover how to add a trainable scalar to query vectors for enhanced attention mechanisms effectively.

- Tags: how-to-guide
- Published: 2026-04-17

### [How Tokenizer-Agnostic Bits-Per-Byte (BPB) Is Calculated in OpenAI's Parameter-Golf](/openai/parameter-golf/how-tokenizer-agnostic-bits-per-byte-bpb-calculated)

Learn how tokenizer-agnostic bits-per-byte is calculated in OpenAI parameter-golf. Discover the formula converting cross-entropy loss to bits using token-to-byte ratios.

- Tags: deep-dive
- Published: 2026-04-17

### [How the 16MB Model Artifact Size Limit Is Enforced in OpenAI Parameter Golf](/openai/parameter-golf/how-16mb-model-artifact-size-limit-enforced)

Discover how OpenAI enforces the 16MB model artifact size limit by calculating combined compressed weights and source code against a hard cap.

- Tags: internals
- Published: 2026-04-17

### [Test-Time Training (TTT) in OpenAI Parameter-Golf: Implementation and Legal Constraints](/openai/parameter-golf/what-is-test-time-training-ttt-applied)

Discover Test-Time Training (TTT) and its OpenAI parameter-golf implementation. Learn how TTT enhances compression rates while respecting causality and single-pass rules.

- Tags: deep-dive
- Published: 2026-04-17

### [How Int8 Quantization with Zlib Compression Reduces Model Artifacts in OpenAI's Parameter Golf](/openai/parameter-golf/how-int8-quantization-zlib-compression-model-artifacts)

Learn how Int8 quantization with zlib compression slashes model artifact sizes to 30% of FP32 by converting weights and applying DEFLATE compression. Explore OpenAI's parameter golf approach.

- Tags: deep-dive
- Published: 2026-04-17

### [How Depth Recurrence Works in the Parameter-Golf GPT Model](/openai/parameter-golf/how-depth-recurrence-implemented-gpt-model)

Discover how depth recurrence works in the parameter-golf GPT model. Learn about splitting the transformer stack, caching hidden states, and re-injecting them with learned skip weights.

- Tags: internals
- Published: 2026-04-17

### [How the Muon Optimizer Works for Parameter-Constrained Training](/openai/parameter-golf/how-does-muon-optimizer-work-parameter-constrained-training)

Discover how the Muon optimizer achieves stable parameter-constrained training by orthogonalizing gradients and scaling them effectively for quantized transformers.

- Tags: deep-dive
- Published: 2026-04-17