# How Depth Recurrence Works in the Parameter-Golf GPT Model

> Discover how depth recurrence works in the parameter-golf GPT model. Learn about splitting the transformer stack, caching hidden states, and re-injecting them with learned skip weights.

- Repository: [OpenAI/parameter-golf](https://github.com/openai/parameter-golf)
- Tags: internals
- Published: 2026-04-17

---

**Depth recurrence in the parameter-golf GPT model is implemented by splitting the transformer stack into encoder and decoder halves, caching intermediate hidden states during the encoder phase, and re-injecting them in reverse order during the decoder phase using learned skip weights.**

The `openai/parameter-golf` repository explores parameter-efficient transformer architectures. Depth recurrence is a central technique that allows the model to reuse representations across layers without increasing the total parameter count. This article examines the specific implementation details found in the source code.

## Understanding the Encoder-Decoder Split in Depth Recurrence

The architecture divides the standard transformer stack into two functional groups. The first half processes the input sequence and accumulates intermediate representations, while the second half consumes those representations to produce the final output.

### Layer Partitioning Logic

In [`train_gpt.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt.py), the model initializes the split by calculating the midpoint of the total layer count. The code assigns half the layers to the encoder and the remainder to the decoder:

```python
self.num_encoder_layers = num_layers // 2
self.num_decoder_layers = num_layers - self.num_encoder_layers

```

This logic appears at lines 70-73 in [`train_gpt.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt.py). For a model with 8 layers, this creates 4 encoder layers and 4 decoder layers.

### Storage of Intermediate Activations

During the forward pass, the model collects the output of each encoder block in a list called `skips`. This caching happens immediately after each encoder transformation:

```python
for i in range(self.num_encoder_layers):
    x = self.blocks[i](x, x0)
    skips.append(x)

```

You can find this loop at lines 6-10 in [`train_gpt.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt.py). The `skips` list holds the depth-recurrent connections that will feed into the decoder.

## Implementing Skip Connections in the Decoder

The decoder phase distinguishes this architecture from standard transformers. Instead of processing layers sequentially with only immediate previous-layer access, each decoder step can reference specific encoder states from the first half of the network.

### Reverse-Order Skip Consumption

The decoder iterates through its assigned layers in reverse order relative to how skips were stored. As the code proceeds through decoder layers, it pops the most recent skip from the list:

```python
for i in range(self.num_decoder_layers):
    if skips:
        x = x + self.skip_weights[i].to(dtype=x.dtype)[None, None, :] * skips.pop()
    x = self.blocks[self.num_encoder_layers + i](x, x0)

```

This implementation at lines 11-14 in [`train_gpt.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt.py) ensures that the deepest encoder layer connects to the shallowest decoder layer, creating a U-Net-like structure within the transformer stack.

### Learned Skip Weights

The model does not simply add the cached activations directly. Instead, it applies a learnable weight vector `skip_weights` to each skip connection:

```python
self.skip_weights = torch.nn.Parameter(torch.zeros(num_decoder_layers, model_dim))

```

This parameterization allows the network to learn how much influence each specific encoder layer should have on its corresponding decoder layer, with initial values of zero ensuring the skip connections do not disrupt training at initialization.

## Residual Mixing and Token Embedding Recurrence

Beyond the encoder-decoder split, each transformer block maintains a connection to the original token embeddings through a learned mixing parameter.

### The resid_mix Parameter

Every `Block` contains a `resid_mix` parameter of shape `[2, dim]` that blends the current hidden state with the initial token embedding `x0`:

```python
mix = self.resid_mix.to(dtype=x.dtype)
x = mix[0][None, None, :] * x + mix[1][None, None, :] * x0

```

This mechanism, found at lines 637-640 in [`train_gpt.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt.py), allows each layer to dynamically choose how much fresh information to pull from the original embeddings versus the processed representations, creating another form of recurrence through the network depth.

## Complete Implementation Example

The following code demonstrates how to instantiate the GPT model with depth recurrence and inspect its configuration:

```python
import torch
from train_gpt import GPT, Hyperparameters

# Configure model with 8 layers (4 encoder + 4 decoder)

args = Hyperparameters()
model = GPT(
    vocab_size=args.vocab_size,
    num_layers=8,                     # Creates depth recurrence split

    model_dim=args.model_dim,
    num_heads=args.num_heads,
    num_kv_heads=args.num_kv_heads,
    mlp_mult=args.mlp_mult,
    tie_embeddings=args.tie_embeddings,
    tied_embed_init_std=args.tied_embed_init_std,
    logit_softcap=args.logit_softcap,
    rope_base=args.rope_base,
    qk_gain_init=args.qk_gain_init,
)

# Verify depth recurrence structure

print(f"Encoder layers: {model.num_encoder_layers}")      # → 4

print(f"Decoder layers: {model.num_decoder_layers}")      # → 4

print(f"Skip weights shape: {model.skip_weights.shape}")  # (4, model_dim)

# Forward pass with automatic skip connections

batch = torch.randint(0, args.vocab_size, (2, args.train_seq_len))
logits = model(batch, batch)  # Depth recurrence applied internally

```

## Summary

- **Depth recurrence** in `openai/parameter-golf` splits the transformer into encoder and decoder halves using integer division of `num_layers // 2`.
- **Skip connections** store encoder outputs in a `skips` list and consume them in reverse order during the decoder phase.
- **Learned parameters** (`skip_weights` and `resid_mix`) control the strength of connections between encoder-decoder pairs and between current states and original embeddings.
- **Implementation** resides primarily in [`train_gpt.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt.py) at lines 70-73 (layer split), 6-14 (skip storage and consumption), and 637-640 (residual mixing).

## Frequently Asked Questions

### What is the purpose of splitting layers into encoder and decoder?

The split enables the model to reuse intermediate representations from early layers to inform deeper computation without adding extra parameters. By treating the first half as an encoder that saves "memories" and the second half as a decoder that recalls them, the architecture achieves depth recurrence that improves representation quality while maintaining parameter efficiency.

### How does the reverse-order skip mechanism work?

The mechanism pops cached activations from the end of the `skips` list as the decoder progresses. Because the list was filled sequentially during encoding (layer 0, then 1, then 2...), popping removes items in reverse order (last encoder layer first). This wiring connects the deepest encoder layer to the shallowest decoder layer, creating a symmetric U-Net-like information flow through the transformer stack.

### What role does resid_mix play in depth recurrence?

The `resid_mix` parameter provides a learned mechanism to blend the current hidden state with the original token embeddings at every layer. This creates a form of depth recurrence by ensuring that information from the input tokens can directly influence deep layers without being filtered through many intermediate transformations. The `[2, dim]` shape allows the model to learn interpolation coefficients between the current state and the initial embedding.

### Where is the depth recurrence logic located in the repository?

The core implementation resides in [`train_gpt.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt.py) within the `GPT` class. The layer split logic appears at lines 70-73, the skip storage mechanism is at lines 6-10, the decoder skip consumption is at lines 11-14, and the residual mixing occurs at lines 637-640. An alternative implementation for the MLX backend is available in [`train_gpt_mlx.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt_mlx.py), which mirrors the same recurrence patterns.