How Depth Recurrence Works in the Parameter-Golf GPT Model
Depth recurrence in the parameter-golf GPT model is implemented by splitting the transformer stack into encoder and decoder halves, caching intermediate hidden states during the encoder phase, and re-injecting them in reverse order during the decoder phase using learned skip weights.
The openai/parameter-golf repository explores parameter-efficient transformer architectures. Depth recurrence is a central technique that allows the model to reuse representations across layers without increasing the total parameter count. This article examines the specific implementation details found in the source code.
Understanding the Encoder-Decoder Split in Depth Recurrence
The architecture divides the standard transformer stack into two functional groups. The first half processes the input sequence and accumulates intermediate representations, while the second half consumes those representations to produce the final output.
Layer Partitioning Logic
In train_gpt.py, the model initializes the split by calculating the midpoint of the total layer count. The code assigns half the layers to the encoder and the remainder to the decoder:
self.num_encoder_layers = num_layers // 2
self.num_decoder_layers = num_layers - self.num_encoder_layers
This logic appears at lines 70-73 in train_gpt.py. For a model with 8 layers, this creates 4 encoder layers and 4 decoder layers.
Storage of Intermediate Activations
During the forward pass, the model collects the output of each encoder block in a list called skips. This caching happens immediately after each encoder transformation:
for i in range(self.num_encoder_layers):
x = self.blocks[i](x, x0)
skips.append(x)
You can find this loop at lines 6-10 in train_gpt.py. The skips list holds the depth-recurrent connections that will feed into the decoder.
Implementing Skip Connections in the Decoder
The decoder phase distinguishes this architecture from standard transformers. Instead of processing layers sequentially with only immediate previous-layer access, each decoder step can reference specific encoder states from the first half of the network.
Reverse-Order Skip Consumption
The decoder iterates through its assigned layers in reverse order relative to how skips were stored. As the code proceeds through decoder layers, it pops the most recent skip from the list:
for i in range(self.num_decoder_layers):
if skips:
x = x + self.skip_weights[i].to(dtype=x.dtype)[None, None, :] * skips.pop()
x = self.blocks[self.num_encoder_layers + i](x, x0)
This implementation at lines 11-14 in train_gpt.py ensures that the deepest encoder layer connects to the shallowest decoder layer, creating a U-Net-like structure within the transformer stack.
Learned Skip Weights
The model does not simply add the cached activations directly. Instead, it applies a learnable weight vector skip_weights to each skip connection:
self.skip_weights = torch.nn.Parameter(torch.zeros(num_decoder_layers, model_dim))
This parameterization allows the network to learn how much influence each specific encoder layer should have on its corresponding decoder layer, with initial values of zero ensuring the skip connections do not disrupt training at initialization.
Residual Mixing and Token Embedding Recurrence
Beyond the encoder-decoder split, each transformer block maintains a connection to the original token embeddings through a learned mixing parameter.
The resid_mix Parameter
Every Block contains a resid_mix parameter of shape [2, dim] that blends the current hidden state with the initial token embedding x0:
mix = self.resid_mix.to(dtype=x.dtype)
x = mix[0][None, None, :] * x + mix[1][None, None, :] * x0
This mechanism, found at lines 637-640 in train_gpt.py, allows each layer to dynamically choose how much fresh information to pull from the original embeddings versus the processed representations, creating another form of recurrence through the network depth.
Complete Implementation Example
The following code demonstrates how to instantiate the GPT model with depth recurrence and inspect its configuration:
import torch
from train_gpt import GPT, Hyperparameters
# Configure model with 8 layers (4 encoder + 4 decoder)
args = Hyperparameters()
model = GPT(
vocab_size=args.vocab_size,
num_layers=8, # Creates depth recurrence split
model_dim=args.model_dim,
num_heads=args.num_heads,
num_kv_heads=args.num_kv_heads,
mlp_mult=args.mlp_mult,
tie_embeddings=args.tie_embeddings,
tied_embed_init_std=args.tied_embed_init_std,
logit_softcap=args.logit_softcap,
rope_base=args.rope_base,
qk_gain_init=args.qk_gain_init,
)
# Verify depth recurrence structure
print(f"Encoder layers: {model.num_encoder_layers}") # → 4
print(f"Decoder layers: {model.num_decoder_layers}") # → 4
print(f"Skip weights shape: {model.skip_weights.shape}") # (4, model_dim)
# Forward pass with automatic skip connections
batch = torch.randint(0, args.vocab_size, (2, args.train_seq_len))
logits = model(batch, batch) # Depth recurrence applied internally
Summary
- Depth recurrence in
openai/parameter-golfsplits the transformer into encoder and decoder halves using integer division ofnum_layers // 2. - Skip connections store encoder outputs in a
skipslist and consume them in reverse order during the decoder phase. - Learned parameters (
skip_weightsandresid_mix) control the strength of connections between encoder-decoder pairs and between current states and original embeddings. - Implementation resides primarily in
train_gpt.pyat lines 70-73 (layer split), 6-14 (skip storage and consumption), and 637-640 (residual mixing).
Frequently Asked Questions
What is the purpose of splitting layers into encoder and decoder?
The split enables the model to reuse intermediate representations from early layers to inform deeper computation without adding extra parameters. By treating the first half as an encoder that saves "memories" and the second half as a decoder that recalls them, the architecture achieves depth recurrence that improves representation quality while maintaining parameter efficiency.
How does the reverse-order skip mechanism work?
The mechanism pops cached activations from the end of the skips list as the decoder progresses. Because the list was filled sequentially during encoding (layer 0, then 1, then 2...), popping removes items in reverse order (last encoder layer first). This wiring connects the deepest encoder layer to the shallowest decoder layer, creating a symmetric U-Net-like information flow through the transformer stack.
What role does resid_mix play in depth recurrence?
The resid_mix parameter provides a learned mechanism to blend the current hidden state with the original token embeddings at every layer. This creates a form of depth recurrence by ensuring that information from the input tokens can directly influence deep layers without being filtered through many intermediate transformations. The [2, dim] shape allows the model to learn interpolation coefficients between the current state and the initial embedding.
Where is the depth recurrence logic located in the repository?
The core implementation resides in train_gpt.py within the GPT class. The layer split logic appears at lines 70-73, the skip storage mechanism is at lines 6-10, the decoder skip consumption is at lines 11-14, and the residual mixing occurs at lines 637-640. An alternative implementation for the MLX backend is available in train_gpt_mlx.py, which mirrors the same recurrence patterns.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →