# How to Configure bfloat16 Mixed Precision Training in Parameter-Golf

> Configure bfloat16 mixed precision training in openai parameter-golf. Learn how to speed up models with PyTorch autocast or MLX backend settings for efficient deep learning.

- Repository: [OpenAI/parameter-golf](https://github.com/openai/parameter-golf)
- Tags: how-to-guide
- Published: 2026-04-17

---

**Enable bfloat16 mixed precision training in the openai/parameter-golf repository by wrapping PyTorch forward passes in `torch.autocast(device_type="cuda", dtype=torch.bfloat16)` or setting `COMPUTE_DTYPE = mx.bfloat16` in the MLX backend.**

The Parameter-Golf repository provides dual-backend support for training transformer models, implementing bfloat16 mixed precision for both PyTorch and MLX frameworks. Configuring bfloat16 reduces memory footprint and accelerates training throughput on modern hardware while maintaining numerical stability for large language model training.

## PyTorch Backend: Using torch.autocast for bfloat16

The PyTorch implementation in [`train_gpt.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt.py) implements mixed precision through automatic casting contexts and explicit tensor conversions.

### Wrapping the Forward Pass

The training loop wraps the loss computation in a `torch.autocast` context manager that forces compute operations to use bfloat16 precision. In [`train_gpt.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt.py) at line 258, the implementation uses:

```python
with torch.autocast(device_type="cuda", dtype=torch.bfloat16, enabled=True):
    loss = model(x, y)

```

This context manager ensures that compatible CUDA operations execute in bfloat16 while automatically handling type promotion for operations that require higher precision.

### Explicit Tensor Casting

Before the forward pass, input tensors are explicitly cast to bfloat16. At line 100 in [`train_gpt.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt.py), the code converts the input batch:

```python
X = G.bfloat16()

```

This explicit casting ensures that model parameters and activations remain in bfloat16 throughout the computation graph, maximizing memory efficiency on NVIDIA A100 and compatible GPUs.

### Optimizer State Management

When using the Muon optimizer, the Parameter-Golf repository stores optimizer states in bfloat16 to maintain consistency with the compute dtype. The optimizer allocates update buffers using bfloat16 precision:

```python
updates_flat = torch.zeros(total_params, device=params[0].device, dtype=torch.bfloat16)

```

This approach, derived from the implementation around line 40 in [`train_gpt.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt.py), ensures that gradient accumulation and parameter updates occur in the same precision as the forward pass, eliminating dtype mismatches during the optimization step.

## MLX Backend: Global Compute Dtype Configuration

The MLX backend in [`train_gpt_mlx.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt_mlx.py) uses a global constant to control precision across the entire computational graph, providing a unified approach to mixed precision training on Apple Silicon.

### Setting the COMPUTE_DTYPE Constant

At line 33 in [`train_gpt_mlx.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt_mlx.py), the repository defines a module-level constant that determines the precision of all arithmetic operations:

```python
COMPUTE_DTYPE = mx.bfloat16

```

Changing this single constant to `mx.float32` or `mx.float16` instantly switches the entire training pipeline to a different precision level without modifying individual operations.

### Model-wide Type Casting

The transformer layers cast tensors to `COMPUTE_DTYPE` before expensive operations. At lines 333-334 in [`train_gpt_mlx.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt_mlx.py), the attention mechanism applies the dtype conversion:

```python
q = self.rope(rms_norm(q).astype(COMPUTE_DTYPE))

```

This pattern appears throughout the model implementation, ensuring that embeddings, attention projections, and final logits all operate in bfloat16. The explicit casting allows MLX to optimize memory layout and compute kernels specifically for Apple M-series chips, where bfloat16 provides significant speedups over float32 with minimal accuracy degradation.

## Hardware Compatibility and Performance

The bfloat16 implementation targets specific hardware capabilities to maximize training efficiency. **NVIDIA A100** and **AMD MI250** GPUs natively support bfloat16 tensor cores, allowing the PyTorch backend to achieve theoretical peak throughput. **Apple M-series** chips (M1, M2, M3, and M4) provide optimized bfloat16 paths through the MLX framework, reducing memory bandwidth pressure during transformer training.

## Summary

- **PyTorch backend**: Wrap forward passes in `torch.autocast(device_type="cuda", dtype=torch.bfloat16)` and explicitly cast inputs via `.bfloat16()` as implemented in [`train_gpt.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt.py).
- **MLX backend**: Set the global `COMPUTE_DTYPE = mx.bfloat16` constant in [`train_gpt_mlx.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt_mlx.py) to enable model-wide bfloat16 arithmetic.
- **Optimizer state**: Both backends maintain bfloat16 precision for optimizer buffers to ensure consistent dtype throughout the training loop.
- **Hardware**: bfloat16 training delivers optimal performance on NVIDIA A100, AMD MI250, and Apple Silicon (M-series) chips.

## Frequently Asked Questions

### Does Parameter-Golf support automatic mixed precision or manual casting?

Parameter-Golf supports both approaches depending on the backend. The PyTorch implementation in [`train_gpt.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt.py) uses **automatic mixed precision** via `torch.autocast`, which automatically casts eligible operations to bfloat16 while keeping certain computations in float32 for stability. The MLX implementation in [`train_gpt_mlx.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt_mlx.py) uses **manual casting** through the `COMPUTE_DTYPE` constant and explicit `.astype()` calls throughout the model architecture.

### Which hardware benefits most from bfloat16 training?

**NVIDIA A100 and H100 GPUs** benefit significantly due to native bfloat16 tensor core support, delivering up to 2x throughput compared to float32 on the same hardware. **AMD MI250 and MI300X** accelerators also provide optimized bfloat16 paths. On **Apple Silicon** (M1, M2, M3, M4), the MLX framework specifically optimizes bfloat16 operations to reduce memory bandwidth bottlenecks common in transformer training.

### How does the Muon optimizer handle bfloat16 state?

The Muon optimizer implementation in [`train_gpt.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt.py) allocates its update buffers directly in **bfloat16** using `torch.zeros(..., dtype=torch.bfloat16)`. This ensures that gradient accumulation and parameter updates occur in the same precision as the forward pass, preventing costly dtype conversions during the optimization step. The optimizer maintains numerical stability by performing certain orthogonalization steps in higher precision when necessary, then casting final updates back to bfloat16.

### Can I switch between float32 and bfloat16 without changing code?

For the **MLX backend**, you can switch precisions by modifying only the `COMPUTE_DTYPE` constant at the top of [`train_gpt_mlx.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt_mlx.py) (line 33), changing `mx.bfloat16` to `mx.float32` or `mx.float16`. For the **PyTorch backend**, switching requires modifying the `torch.autocast` context manager and explicit `.bfloat16()` casts in [`train_gpt.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt.py), as these are hardcoded in the training loop (lines 100 and 258). There is no runtime configuration flag for the PyTorch backend; code modification is necessary.