How Skip Weights Enable Depth Recurrence in Parameter-Golf's GPT Architecture

Skip weights enable depth recurrence by acting as learned per-layer scalars that modulate how much of stored encoder activations are re-injected into the decoder stream, allowing the model to dynamically revisit earlier depth representations.

In the openai/parameter-golf repository, depth recurrence is implemented through a unique encoder-decoder split within a single transformer stack. This mechanism allows the model to store intermediate activations from the first half of the network and reuse them in the second half, creating a recurrent pathway through depth. The learned skip weights are central to this architecture, determining how much past information to preserve versus overwrite as the network processes deeper layers.

Understanding Depth Recurrence in Transformer Architectures

Traditional transformer models process tokens through a linear stack of blocks, where each layer feeds sequentially into the next. Depth recurrence breaks this linearity by allowing the network to revisit earlier computation states after processing intermediate layers.

In the Parameter-Golf implementation, the architecture splits the total layer count into two halves:

  • Encoder half: Computes initial representations and stores them
  • Decoder half: Reprocesses representations while accessing the stored encoder states

This creates a U-shaped information flow where the model can "look back" at its own intermediate computations, effectively making the architecture recurrent along the depth dimension rather than the sequence dimension.

The Role of Skip Weights in Depth Recurrence

Encoder Half: Storing Intermediate Activations

The encoder portion of the forward pass runs through the first half of the transformer blocks and stores each layer's output in a list called skips. This occurs in train_gpt.py lines 1006-1010:

skips: list[Tensor] = []
for i in range(self.num_encoder_layers):
    x = self.blocks[i](x, x0)      # run block

    skips.append(x)                # remember output

Each append operation captures the activation tensor at that specific depth, creating a stack of intermediate representations that range from shallow (early layers) to deep (later encoder layers).

Decoder Half: Reusing Activations with Learned Scalars

The decoder half retrieves these stored activations in reverse order (last encoder layer first) and blends them into the current processing stream using the learned skip weights. This happens in train_gpt.py lines 1011-1014:

for i in range(self.num_decoder_layers):
    if skips:
        x = x + self.skip_weights[i].to(dtype=x.dtype)[None,None,:] * skips.pop()
    x = self.blocks[self.num_encoder_layers + i](x, x0)

The operation skips.pop() retrieves the most recently stored encoder activation, and skip_weights[i] scales this activation element-wise before adding it to the current decoder stream x.

Skip Weight Parameter Definition

The skip weights themselves are defined as learnable parameters in train_gpt.py lines 672-674:

self.num_skip_weights = min(self.num_encoder_layers, self.num_decoder_layers)
self.skip_weights = nn.Parameter(
    torch.ones(self.num_skip_weights, model_dim, dtype=torch.float32)
)

Key characteristics of this parameterization:

  • Shape: (num_skip_weights, model_dim) - one weight vector for each possible skip connection
  • Initialization: All ones, providing a neutral starting point where equal weight is given to skipped connections
  • Broadcasting: During forward pass, the weight vector broadcasts over batch and sequence dimensions via [None,None,:]

Implementation Details in train_gpt.py

The complete mechanism spans several sections of the main training script. The forward pass logic resides in the GPT class definition, specifically within the forward method implementation.

When inspecting the model after initialization, you can verify the skip weight configuration:

from train_gpt import GPT, Hyperparameters

# Use the same hyper-parameters as the training script

hp = Hyperparameters()
model = GPT(
    vocab_size=hp.vocab_size,
    num_layers=hp.num_layers,
    model_dim=hp.model_dim,
    num_heads=hp.num_heads,
    num_kv_heads=hp.num_kv_heads,
    mlp_mult=hp.mlp_mult,
    tie_embeddings=hp.tie_embeddings,
    tied_embed_init_std=hp.tied_embed_init_std,
    logit_softcap=hp.logit_softcap,
    rope_base=hp.rope_base,
    qk_gain_init=hp.qk_gain_init,
)

print("Number of skip-weights:", model.skip_weights.shape[0])
print("Shape of each weight vector:", model.skip_weights.shape[1])
print("First skip-weight vector (first 5 elements):", model.skip_weights[0, :5])

This outputs a tensor of shape (num_skip_weights, model_dim), confirming that each skip connection maintains a full vector of scaling factors across the hidden dimension.

Why Skip Weights Create Depth Recurrence

Skip weights enable depth recurrence by transforming the static U-Net style skip connections into dynamic, learnable pathways. Without these weights, the architecture would simply add encoder activations to decoder layers with fixed unity gain, providing residual connections but not true recurrence.

With learned skip weights:

  • Selective memory: The model learns which depth levels are worth preserving and which should be attenuated via near-zero weights
  • Adaptive blending: Each decoder layer receives a differently weighted mixture of its current state and the historical encoder state
  • Gradient flow: The scalar weights provide a direct gradient path from decoder outputs back to encoder representations, encouraging the encoder to produce useful "memories" for later retrieval

The recurrence emerges because the decoder effectively "revisits" each encoder depth level in reverse order, carrying forward accumulated processing while modulating how much of the past to retain at each step.

Training Considerations

Skip weights are optimized alongside other model parameters using standard gradient descent. In train_gpt.py around line 502, the optimizer configuration includes skip weights in the set of scalar parameters:

if k == "skip_weights" or (k.startswith("blocks.") and ...):
    scalar_params.append(p)

This categorization often allows for specialized learning rate schedules or weight decay treatments for scalar parameters versus tensor parameters. Because skip weights are simple scalars per hidden dimension, they add negligible parameter overhead (approximately min(encoder_layers, decoder_layers) * model_dim parameters) while providing significant representational flexibility.

Summary

  • Skip weights are learned per-layer vectors that scale encoder activations before they are added to decoder layers in the Parameter-Golf GPT architecture.
  • Depth recurrence is achieved by splitting the transformer into encoder and decoder halves, storing intermediate encoder states (skips), and re-injecting them into the decoder in reverse order.
  • The mechanism is implemented in train_gpt.py lines 672-674 (parameter definition), 1006-1010 (encoder storage), and 1011-1014 (decoder retrieval).
  • Skip weights allow the model to dynamically control information flow across depth levels, creating a recurrent pathway through the network's layer stack.

Frequently Asked Questions

What is the difference between skip weights and standard residual connections?

Standard residual connections add a layer's input to its output with a fixed weight of 1.0, helping gradients flow through deep networks. Skip weights in Parameter-Golf are learned scaling factors that modulate how much of an encoder layer's activation is added to a decoder layer, allowing the model to learn which depth levels to preserve and which to attenuate.

How many parameters do skip weights add to the model?

Skip weights add min(num_encoder_layers, num_decoder_layers) * model_dim parameters to the model. For a model with 12 encoder layers, 12 decoder layers, and 768 model dimensions, this adds only 9,216 parameters—negligible compared to the millions of parameters in the transformer blocks themselves.

Can skip weights be disabled during inference?

Skip weights are fixed parameters after training, not gating mechanisms that can be dynamically disabled. However, if a skip weight vector converges to near-zero values during training, that specific skip connection effectively becomes disabled, allowing the model to learn which depth recurrence pathways are actually useful for the task.

Where are skip weights initialized in the codebase?

Skip weights are initialized in the GPT class constructor in train_gpt.py at lines 672-674. They are created as nn.Parameter objects initialized to ones (torch.ones), providing a neutral starting point where each skip connection initially passes through its full activation value before the network learns optimal scaling during training.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →