# LeakyReLU Squared Activation Function Implementation in OpenAI Parameter-Golf

> Discover how the LeakyReLU squared activation function is implemented in OpenAI's parameter-golf repository. Learn the efficient two-step inline operation.

- Repository: [OpenAI/parameter-golf](https://github.com/openai/parameter-golf)
- Tags: internals
- Published: 2026-04-17

---

**The LeakyReLU squared activation function in the parameter-golf repository is implemented as an inline two-step operation: applying `F.leaky_relu` with a negative slope of 0.5 followed by an element-wise `.square()` operation.**

The OpenAI parameter-golf repository explores parameter-efficient neural architectures through constrained training experiments. The **LeakyReLU squared activation function** appears consistently across multiple model configurations, providing a smooth, non-negative quadratic output while preserving gradient flow for negative inputs.

## Implementation Pattern

The implementation is not encapsulated as a dedicated layer module. Instead, it is built inline where the MLP’s linear projection is applied, following a strict two-stage pipeline:

1. **Leaky ReLU**: Applied with `negative_slope=0.5` to preserve gradient information for negative values
2. **Element-wise squaring**: The `.square()` operation ensures non-negative outputs and adds quadratic curvature

In [`train_gpt_decode.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt_decode.py) at line 461, the activation appears as:

```python
return self.proj(F.leaky_relu(self.fc(x), negative_slope=0.5).square())

```

This same pattern repeats across multiple experiment files in the `records/track_10min_16mb/` directory:
- [`train_gpt_human.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt_human.py) at line 429 for the GPTQ-Embeddings experiment
- [`train_gpt.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt.py) at line 545 for the Vocab-4096 MLP-mult 4 configuration
- [`train_gpt.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt.py) at line 331 for the Mini-Depth Recurrence model (using a configurable `self.neg_slope` parameter)

## Technical Breakdown

### Mathematical Formulation

The LeakyReLU squared activation function computes:

```

f(x) = max(0.5 * x, x)^2

```

Where:
- For positive inputs: `f(x) = x^2` (standard quadratic)
- For negative inputs: `f(x) = (0.5 * x)^2 = 0.25 * x^2` (attenuated quadratic)

This formulation ensures **non-negative outputs** and **smooth gradients** for all inputs, unlike standard ReLU which yields zero gradients for negative values.

### Reusable Module Implementation

While the repository uses inline composition for brevity, the pattern can be encapsulated for reuse:

```python
import torch
import torch.nn as nn
import torch.nn.functional as F

class LeakyReLUSquared(nn.Module):
    """Leaky ReLU with slope 0.5, followed by element-wise square."""
    def __init__(self, negative_slope: float = 0.5):
        super().__init__()
        self.negative_slope = negative_slope

    def forward(self, x):
        return F.leaky_relu(x, negative_slope=self.negative_slope).square()

# Example usage inside an MLP block

class MLPBlock(nn.Module):
    def __init__(self, dim_in: int, dim_out: int):
        super().__init__()
        self.fc = nn.Linear(dim_in, dim_out)
        self.proj = nn.Linear(dim_out, dim_out)
        self.act = LeakyReLUSquared()

    def forward(self, x):
        x = self.fc(x)
        x = self.act(x)
        return self.proj(x)

```

Running the example:

```python
x = torch.randn(4, 8)
block = MLPBlock(8, 32)
y = block(x)
print(y.shape)  # torch.Size([4, 32])

```

## Summary

- The **LeakyReLU squared activation function** in parameter-golf is implemented as `F.leaky_relu(..., negative_slope=0.5).square()` inline within MLP blocks
- Key source locations include [`train_gpt_decode.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt_decode.py) (line 461), [`train_gpt_human.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt_human.py) (line 429), and [`train_gpt.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt.py) (lines 331 and 545)
- The activation uses a **negative slope of 0.5**, providing gradients for negative inputs at 25% strength after squaring
- No dedicated layer module exists; the function is composed directly using PyTorch operations for maximum parameter efficiency

## Frequently Asked Questions

### What is the negative slope value used in the LeakyReLU squared implementation?

The implementation uses a **negative slope of 0.5** for the LeakyReLU component. This value appears consistently across all experiment files including the Hessian SD-Clip, GPTQ-Embeddings, and Vocab-4096 configurations. The Mini-Depth Recurrence experiment at line 331 of [`train_gpt.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt.py) uses a configurable `self.neg_slope` parameter, but the standard value remains 0.5.

### Why square the output of LeakyReLU instead of using standard ReLU?

Squaring the LeakyReLU output creates a **smooth, non-negative quadratic activation** that provides non-linear curvature for all inputs while preserving gradients. Unlike standard ReLU which yields zero gradients for negative inputs, the LeakyReLU squared formulation maintains gradient flow at 25% strength (0.5 squared) for negative values. This approach also ensures strictly non-negative outputs without requiring absolute value operations.

### Where can I find the exact implementation in the source code?

The exact implementation appears in multiple training scripts within the `records/track_10min_16mb/` directory. The most prominent occurrence is in [`train_gpt_decode.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt_decode.py) at **line 461**, where the activation appears as `F.leaky_relu(self.fc(x), negative_slope=0.5).square()`. You can also find identical patterns in [`train_gpt_human.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt_human.py) at line 429 and [`train_gpt.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt.py) at lines 331 and 545 across different experimental configurations.

### Can I use a different negative slope value with this activation function?

Yes, the implementation supports configurable negative slopes. While the standard experiments use **0.5**, the Mini-Depth Recurrence experiment in [`train_gpt.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt.py) at line 331 demonstrates a configurable implementation using `self.neg_slope`. When modifying the slope, remember that the final gradient scaling for negative inputs will be the square of your chosen slope value (e.g., slope 0.3 yields 0.09 gradient scaling).