# Logit Softcap Transformation in OpenAI Parameter-Golf: PyTorch and MLX Implementation

> Discover the logit softcap transformation in OpenAI parameter-golf. Learn how this PyTorch and MLX implementation bounds logits to prevent instability and preserve prediction distribution. Optimize your models!

- Repository: [OpenAI/parameter-golf](https://github.com/openai/parameter-golf)
- Tags: internals
- Published: 2026-04-17

---

**The logit softcap transformation in openai/parameter-golf bounds extreme logits using the formula `logit_softcap * tanh(raw_logits / logit_softcap)` before computing cross-entropy loss, preventing training instability while preserving the relative distribution of predictions.**

The **logit softcap transformation** is a stabilization technique implemented in the **openai/parameter-golf** repository to prevent extreme logit values from destabilizing language model training. This transformation rescales raw logits using a hyperbolic tangent function, effectively capping their magnitude while maintaining the relative ranking of predictions across both PyTorch and MLX backends.

## What Is the Logit Softcap Transformation?

The logit softcap transformation applies a **tanh-based rescaling** to raw logits before they are passed to the loss function. The mathematical operation follows this exact formula:

```

softcapped_logits = logit_softcap * tanh(raw_logits / logit_softcap)

```

The **tanh** function bounds its output to the range (-1, 1). By multiplying the tanh output by the `logit_softcap` hyperparameter (default **30.0**), the transformation limits the magnitude of logits to approximately ±30, preventing gradient explosions from extreme values while preserving the shape and ranking of the distribution.

## How the Logit Softcap Transformation Works in Parameter-Golf

The implementation varies slightly between the PyTorch and MLX versions of the codebase, but both follow the identical mathematical formula.

### PyTorch Implementation in train_gpt.py

In the PyTorch implementation, the softcap is applied inline during the forward pass immediately after computing the projection logits. In [`train_gpt.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt.py) around lines 723-724, the code implements the transformation as:

```python
logits = self.logit_softcap * torch.tanh(logits_proj / self.logit_softcap)

```

Here, `logits_proj` represents the raw matrix product of the model's final hidden states and the token embedding matrix. The `self.logit_softcap` attribute is initialized from the hyperparameters and validated to ensure it is a positive float.

### MLX Implementation in train_gpt_mlx.py

The MLX version defines the softcap as a dedicated method within the `GPT` class. In [`train_gpt_mlx.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt_mlx.py) at lines 14-16, the `softcap` method is implemented as:

```python
def softcap(self, x):
    return self.logit_softcap * mx.tanh(x / self.logit_softcap)

```

This method is then invoked during the forward pass where the raw logits are computed. The separation into a dedicated method improves code readability and allows for easier modification or removal of the transformation during experimentation.

## Configuring the Logit Softcap Value

The `logit_softcap` value is configurable via environment variables or command-line arguments, making it a model-level hyperparameter that can be tuned per experiment without architectural changes.

**Configuration sources:**
- **Environment variable:** `LOGIT_SOFTCAP` (e.g., `export LOGIT_SOFTCAP=25.0`)
- **Command-line argument:** `--logit-softcap` (validated to be positive)

**Validation:**
The implementation includes explicit validation to prevent invalid configurations:

```python
if logit_softcap <= 0:
    raise ValueError("logit_softcap must be positive")

```

## Practical Code Examples

### PyTorch Example

```python
import os
import torch
from train_gpt import GPT, Hyperparameters

# Configure the softcap value

os.environ["LOGIT_SOFTCAP"] = "25.0"

# Initialize hyperparameters and model

args = Hyperparameters()
model = GPT(
    vocab_size=args.vocab_size,
    num_layers=args.num_layers,
    dim=args.model_dim,
    num_heads=args.num_heads,
    num_kv_heads=args.num_kv_heads,
    mlp_mult=args.mlp_mult,
    logit_softcap=args.logit_softcap,
)

# Forward pass automatically applies softcapping

input_ids = torch.randint(0, args.vocab_size, (4, 128))
target_ids = torch.clone(input_ids)
loss = model.loss(input_ids, target_ids)  # Uses softcapped logits internally

loss.backward()

```

### MLX Example

```python
import os
import mlx.core as mx
from train_gpt_mlx import GPT

# Set the softcap via environment variable

os.environ["LOGIT_SOFTCAP"] = "30.0"

# Initialize the MLX model

model = GPT(
    vocab_size=50000,
    num_layers=12,
    dim=768,
    num_heads=12,
    num_kv_heads=4,
    mlp_mult=4,
    logit_softcap=30.0,
)

# During the forward pass, the softcap method is applied:

# logits = self.softcap(logits_proj)

# where softcap implements: logit_softcap * tanh(x / logit_softcap)

```

## Why Use Logit Softcapping?

**Training Stability:** Extreme logit values can cause gradient explosions during backpropagation, especially in deep transformer models. The tanh-based softcap bounds the maximum magnitude to approximately ±30 (by default), clipping extreme values while maintaining differentiability.

**Distribution Preservation:** Unlike hard clipping, the hyperbolic tangent preserves the relative ordering and probabilistic shape of the distribution. This means the model continues to learn meaningful probability distributions rather than hitting saturation artifacts.

**Architectural Flexibility:** Because the softcap is implemented as a configurable hyperparameter rather than a hardcoded constant, researchers can tune or disable the transformation (by setting it to a very high value) without modifying the model architecture.

## Summary

- The **logit softcap transformation** in openai/parameter-golf uses the formula `logit_softcap * tanh(raw_logits / logit_softcap)` to bound extreme values.
- The **PyTorch implementation** applies the transformation inline in [`train_gpt.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt.py) (lines 723-724) using `torch.tanh`.
- The **MLX implementation** defines a dedicated `softcap` method in [`train_gpt_mlx.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt_mlx.py) (lines 14-16) using `mx.tanh`.
- The default **softcap value** is `30.0`, configurable via the `LOGIT_SOFTCAP` environment variable or `--logit-softcap` command-line argument.
- The transformation prevents **training instability** while preserving the **relative distribution** of predictions before cross-entropy loss computation.

## Frequently Asked Questions

### What is the mathematical formula for the logit softcap transformation?

The logit softcap transformation applies the formula `softcapped_logits = logit_softcap * tanh(raw_logits / logit_softcap)`, where `logit_softcap` is a hyperparameter typically set to 30.0. The hyperbolic tangent bounds the output to (-1, 1), effectively limiting the magnitude of logits to approximately ±30 while maintaining differentiability.

### Where is the logit softcap implemented in the parameter-golf repository?

The implementation appears in two main files: in [`train_gpt.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt.py) for the PyTorch version (around lines 723-724) where it is applied inline as `self.logit_softcap * torch.tanh(logits_proj / self.logit_softcap)`, and in [`train_gpt_mlx.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt_mlx.py) for the MLX version (lines 14-16) where it is encapsulated in the `GPT.softcap` method using `mx.tanh`.

### Why does the parameter-golf model use a softcap instead of hard clipping?

The softcap uses a hyperbolic tangent function rather than hard clipping because it preserves the relative ordering and probabilistic shape of the logit distribution while bounding extreme values. Hard clipping would create flat gradients at the clipping threshold, potentially causing optimization issues, whereas the tanh-based approach maintains smooth differentiability throughout the entire range.

### How do I configure the logit softcap value for my training run?

You can configure the logit softcap value by setting the `LOGIT_SOFTCAP` environment variable (e.g., `export LOGIT_SOFTCAP=25.0`) or by passing the `--logit-softcap` command-line argument when launching training. The value defaults to 30.0 and must be a positive number; the implementation validates this and raises a `ValueError` if the value is zero or negative.