How to Configure Partial Rotary Position Embedding (RoPE) in OpenAI Parameter Golf
Partial Rotary Position Embedding (RoPE) reduces computational overhead by applying rotary positional encodings to only a subset of the model's hidden dimensions, controlled via the rope_dims parameter in the openai/parameter-golf repository.
The openai/parameter-golf codebase implements a flexible partial RoPE mechanism that allows you to specify exactly how many dimensions receive rotary embeddings. This configuration can be adjusted through environment variables, command-line arguments, or direct Python instantiation, making it straightforward to experiment with different positional encoding strategies.
Understanding Partial RoPE in Parameter Golf
Rotary Position Embedding (RoPE) traditionally encodes positional information across all hidden dimensions of query and key vectors. Partial RoPE modifies this by limiting rotation to the first rope_dims dimensions while leaving the remaining dimensions untouched. In the parameter-golf implementation, when rope_dims is set to 0 or exceeds the head dimension, the system defaults to full-dimensional RoPE.
Configuration Methods
Environment Variable (ROPE_DIMS)
The simplest way to configure partial RoPE is through the ROPE_DIMS environment variable. Training scripts in the repository read this value at startup:
rope_dims = int(os.environ.get('ROPE_DIMS', 16))
This snippet appears in record scripts such as records/track_10min_16mb/2026-04-06_SP8192_HessianSDClip_ProgressiveRecurrence/train_gpt_decode.py at line 73. If the environment variable is unset, it defaults to 16 dimensions.
To use this method:
export ROPE_DIMS=32
python train_gpt.py
Command-Line Interface (--rope-dims)
The main training entry point train_gpt.py exposes a --rope-dims argument through argparse. This CLI flag overrides any environment variable setting:
python train_gpt.py --rope-dims 64
This approach provides per-run flexibility without modifying shell environments. The argument parser definition resides in the top-level train_gpt.py file.
Programmatic Configuration
For notebook experimentation or custom training loops, you can pass rope_dims directly to the GPT model constructor:
from train_gpt import GPT
model = GPT(
vocab_size=50257,
num_layers=12,
model_dim=768,
num_heads=12,
num_kv_heads=12,
rope_dims=64 # Apply partial RoPE to first 64 dimensions only
)
This method is demonstrated in record scripts such as records/track_10min_16mb/2026-03-25_ValCalib_GPTQ_XSA_BigramHash3072/train_gpt.py at lines 853-857, where the rope_dims value propagates through each transformer block.
Implementation Details
The Rotary class in train_gpt_decode.py (lines 354-389) implements the partial embedding logic. When rope_dims is positive and less than the head dimension, the class splits input tensors into two parts:
self.rope_dims = rope_dims if rope_dims > 0 else dim
inv_freq = 1.0 / (base ** (torch.arange(0, self.rope_dims, 2) / self.rope_dims))
def apply_rotary_emb(x, cos, sin, rope_dims=0):
if rope_dims > 0 and rope_dims < x.size(-1):
x_rope, x_pass = x[..., :rope_dims], x[..., rope_dims:]
# Rotation applied only to x_rope
This selective application reduces the computational cost of position encoding while preserving positional information in the most critical dimensions.
Summary
- Partial RoPE in openai/parameter-golf is controlled by the
rope_dimsparameter, which specifies how many dimensions receive rotary embeddings. - Configuration options include the
ROPE_DIMSenvironment variable, the--rope-dimsCLI flag, and direct constructor arguments. - When
rope_dimsexceeds the head dimension or equals0, the system defaults to full-dimensional RoPE. - The implementation splits query/key tensors in the
Rotaryclass, processing only the firstrope_dimselements and passing the remainder unchanged.
Frequently Asked Questions
What happens if I set rope_dims larger than the head dimension?
If rope_dims exceeds the head dimension, the Rotary class automatically clamps the value to the full dimension size, effectively enabling standard full RoPE. This safety check ensures the tensor slicing operations remain valid.
Does partial RoPE affect model convergence or final performance?
Partial RoPE reduces the parameter count and computational load for position encoding, which can accelerate training on hardware-limited environments. According to the parameter-golf source code, experiments vary rope_dims across different training tracks, suggesting performance characteristics depend on the specific sequence length and model architecture.
How do I verify that partial RoPE is actually being used during training?
You can inspect the model configuration at runtime by checking the rotary attribute of any attention block. The Rotary instance stores its rope_dims value, which you can print to confirm it matches your intended configuration:
print(model.blocks[0].attn.rotary.rope_dims)
Can I change rope_dims for inference only without retraining?
The repository training scripts suggest rope_dims is a structural hyperparameter fixed at model initialization. While the Rotary class could theoretically accept different dimensions at inference time, the model weights are trained with specific positional encoding patterns, so changing rope_dims after training would likely degrade output quality.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →