LeakyReLU Squared Activation Function Implementation in OpenAI Parameter-Golf
The LeakyReLU squared activation function in the parameter-golf repository is implemented as an inline two-step operation: applying F.leaky_relu with a negative slope of 0.5 followed by an element-wise .square() operation.
The OpenAI parameter-golf repository explores parameter-efficient neural architectures through constrained training experiments. The LeakyReLU squared activation function appears consistently across multiple model configurations, providing a smooth, non-negative quadratic output while preserving gradient flow for negative inputs.
Implementation Pattern
The implementation is not encapsulated as a dedicated layer module. Instead, it is built inline where the MLP’s linear projection is applied, following a strict two-stage pipeline:
- Leaky ReLU: Applied with
negative_slope=0.5to preserve gradient information for negative values - Element-wise squaring: The
.square()operation ensures non-negative outputs and adds quadratic curvature
In train_gpt_decode.py at line 461, the activation appears as:
return self.proj(F.leaky_relu(self.fc(x), negative_slope=0.5).square())
This same pattern repeats across multiple experiment files in the records/track_10min_16mb/ directory:
train_gpt_human.pyat line 429 for the GPTQ-Embeddings experimenttrain_gpt.pyat line 545 for the Vocab-4096 MLP-mult 4 configurationtrain_gpt.pyat line 331 for the Mini-Depth Recurrence model (using a configurableself.neg_slopeparameter)
Technical Breakdown
Mathematical Formulation
The LeakyReLU squared activation function computes:
f(x) = max(0.5 * x, x)^2
Where:
- For positive inputs:
f(x) = x^2(standard quadratic) - For negative inputs:
f(x) = (0.5 * x)^2 = 0.25 * x^2(attenuated quadratic)
This formulation ensures non-negative outputs and smooth gradients for all inputs, unlike standard ReLU which yields zero gradients for negative values.
Reusable Module Implementation
While the repository uses inline composition for brevity, the pattern can be encapsulated for reuse:
import torch
import torch.nn as nn
import torch.nn.functional as F
class LeakyReLUSquared(nn.Module):
"""Leaky ReLU with slope 0.5, followed by element-wise square."""
def __init__(self, negative_slope: float = 0.5):
super().__init__()
self.negative_slope = negative_slope
def forward(self, x):
return F.leaky_relu(x, negative_slope=self.negative_slope).square()
# Example usage inside an MLP block
class MLPBlock(nn.Module):
def __init__(self, dim_in: int, dim_out: int):
super().__init__()
self.fc = nn.Linear(dim_in, dim_out)
self.proj = nn.Linear(dim_out, dim_out)
self.act = LeakyReLUSquared()
def forward(self, x):
x = self.fc(x)
x = self.act(x)
return self.proj(x)
Running the example:
x = torch.randn(4, 8)
block = MLPBlock(8, 32)
y = block(x)
print(y.shape) # torch.Size([4, 32])
Summary
- The LeakyReLU squared activation function in parameter-golf is implemented as
F.leaky_relu(..., negative_slope=0.5).square()inline within MLP blocks - Key source locations include
train_gpt_decode.py(line 461),train_gpt_human.py(line 429), andtrain_gpt.py(lines 331 and 545) - The activation uses a negative slope of 0.5, providing gradients for negative inputs at 25% strength after squaring
- No dedicated layer module exists; the function is composed directly using PyTorch operations for maximum parameter efficiency
Frequently Asked Questions
What is the negative slope value used in the LeakyReLU squared implementation?
The implementation uses a negative slope of 0.5 for the LeakyReLU component. This value appears consistently across all experiment files including the Hessian SD-Clip, GPTQ-Embeddings, and Vocab-4096 configurations. The Mini-Depth Recurrence experiment at line 331 of train_gpt.py uses a configurable self.neg_slope parameter, but the standard value remains 0.5.
Why square the output of LeakyReLU instead of using standard ReLU?
Squaring the LeakyReLU output creates a smooth, non-negative quadratic activation that provides non-linear curvature for all inputs while preserving gradients. Unlike standard ReLU which yields zero gradients for negative inputs, the LeakyReLU squared formulation maintains gradient flow at 25% strength (0.5 squared) for negative values. This approach also ensures strictly non-negative outputs without requiring absolute value operations.
Where can I find the exact implementation in the source code?
The exact implementation appears in multiple training scripts within the records/track_10min_16mb/ directory. The most prominent occurrence is in train_gpt_decode.py at line 461, where the activation appears as F.leaky_relu(self.fc(x), negative_slope=0.5).square(). You can also find identical patterns in train_gpt_human.py at line 429 and train_gpt.py at lines 331 and 545 across different experimental configurations.
Can I use a different negative slope value with this activation function?
Yes, the implementation supports configurable negative slopes. While the standard experiments use 0.5, the Mini-Depth Recurrence experiment in train_gpt.py at line 331 demonstrates a configurable implementation using self.neg_slope. When modifying the slope, remember that the final gradient scaling for negative inputs will be the square of your chosen slope value (e.g., slope 0.3 yields 0.09 gradient scaling).
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →