parameter-golf
Train the smallest LM you can that fits in 16MB. Best model wins!
Learn to configure Flash Attention with Grouped Query Attention (GQA) in Parameter-Golf for faster training. Enable Flash Attention and set num_kv_heads below num_heads.
Logit Softcap Transformation in OpenAI Parameter-Golf: PyTorch and MLX ImplementationDiscover the logit softcap transformation in OpenAI parameter-golf. Learn how this PyTorch and MLX implementation bounds logits to prevent instability and preserve prediction distribution. Optimize your models!
How Gradient Accumulation Works in an 8xH100 Distributed SetupDiscover how gradient accumulation optimizes an 8xH100 distributed setup. Learn how micro-batches and GPU processing maintain a constant effective batch size for superior performance.
How Tied Embeddings Are Implemented and Initialized Efficiently in Parameter-GolfLearn how tied embeddings are efficiently implemented and initialized in parameter-golf. This technique halves memory usage by reusing input embedding weights for the language modeling head.
How to Configure bfloat16 Mixed Precision Training in Parameter-GolfConfigure bfloat16 mixed precision training in openai parameter-golf. Learn how to speed up models with PyTorch autocast or MLX backend settings for efficient deep learning.
How to Configure Partial Rotary Position Embedding (RoPE) in OpenAI Parameter GolfLearn to configure partial Rotary Position Embedding (RoPE) in OpenAI Parameter Golf. Reduce computation by applying RoPE to a subset of dimensions using rope_dims for improved efficiency.
LeakyReLU Squared Activation Function Implementation in OpenAI Parameter-GolfDiscover how the LeakyReLU squared activation function is implemented in OpenAI's parameter-golf repository. Learn the efficient two-step inline operation.
How LZMA Code Compression Reduces Submission Artifact Size in Parameter GolfDiscover how LZMA code compression slashes submission artifact size by 39% for OpenAI parameter golf. Learn about high-ratio entropy coding and quantized weight streams enabling smaller wrappers.
Exponential Moving Average (EMA) Implementation for Small Models in Parameter-GolfLearn how Exponential Moving Average EMA is implemented for small models in Parameter-Golf. Discover its lightweight dictionary-based system.
How the Learning Rate Warmdown Schedule Works in OpenAI Parameter GolfLearn how the learning rate warmdown schedule in OpenAI parameter golf stabilizes convergence with linear decay. Understand its final training steps, wall-clock time, and budget fraction.
Optimal Hyperparameters for Training Under 10 Minutes: The Complete Parameter Golf RecordDiscover optimal hyperparameters for training under 10 minutes. Achieve 1.0810 bits-per-byte using an 11-layer transformer, MuonEq-R, and int6 quantization on 8xH100 GPUs.
How Skip Weights Enable Depth Recurrence in Parameter-Golf's GPT ArchitectureDiscover how skip weights enable depth recurrence in Parameter-Golf's GPT architecture. Learn how learned scalars modulate encoder activations for dynamic revisiting of depth representations.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →