# How to Configure Flash Attention with Grouped Query Attention (GQA) in Parameter-Golf

> Learn to configure Flash Attention with Grouped Query Attention (GQA) in Parameter-Golf for faster training. Enable Flash Attention and set num_kv_heads below num_heads.

- Repository: [OpenAI/parameter-golf](https://github.com/openai/parameter-golf)
- Tags: how-to-guide
- Published: 2026-04-17

---

**You configure Flash Attention by calling `enable_flash_sdp(True)` from `torch.backends.cuda` and enable GQA by setting `num_kv_heads` lower than `num_heads` in the `Hyperparameters` class, which triggers the `enable_gqa=True` flag inside `CausalSelfAttention.forward`.**

The `openai/parameter-golf` repository provides a minimal GPT training implementation that supports both Flash Attention (via PyTorch's SDPA dispatcher) and Grouped Query Attention (GQA) for memory-efficient inference. Understanding how to configure these features together allows you to optimize throughput while controlling KV-cache size.

## Understanding Flash Attention and GQA in Parameter-Golf

Flash Attention is a memory-efficient attention algorithm implemented as a CUDA kernel in PyTorch 2.0+. When enabled, it replaces the standard `scaled_dot_product_attention` implementation with a fused kernel that avoids materializing the full attention matrix.

Grouped Query Attention reduces memory bandwidth during inference by sharing key and value heads across multiple query heads. In `parameter-golf`, GQA is activated whenever `num_kv_heads` differs from `num_heads`, creating a **grouped attention pattern** where each KV head serves multiple query heads.

## Enabling Flash Attention in train_gpt.py

Flash Attention is enabled globally at the start of the training script through PyTorch's backend configuration.

### The enable_flash_sdp Configuration

In [`train_gpt.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt.py) at lines 64-68, the script configures the SDPA (Scaled Dot Product Attention) dispatcher:

```python
from torch.backends.cuda import enable_flash_sdp, enable_cudnn_sdp, enable_math_sdp, enable_mem_efficient_sdp

enable_flash_sdp(True)            # Enable Flash Attention kernel

enable_mem_efficient_sdp(False)   # Disable memory-efficient alternative

enable_math_sdp(False)            # Disable fallback math implementation

enable_cudnn_sdp(False)           # Disable CuDNN SDPA (optional)

```

This configuration ensures that all subsequent calls to `F.scaled_dot_product_attention` use the Flash Attention implementation when running on compatible CUDA hardware (Ampere or newer).

## Configuring Grouped Query Attention (GQA)

GQA configuration involves two components: the hyperparameter definition and the runtime activation inside the attention mechanism.

### Setting num_kv_heads vs num_heads

In the `Hyperparameters` class (lines 39-70 in [`train_gpt.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt.py)), you define the attention head structure:

```python
class Hyperparameters:
    num_heads: int = 8        # Total number of query heads

    num_kv_heads: int = 4     # Number of key/value heads (GQA when < num_heads)

```

When `num_kv_heads` (4) is less than `num_heads` (8), the model configures a **4:1 grouping ratio** where each KV head serves 2 query heads.

### The enable_gqa Parameter in scaled_dot_product_attention

The actual GQA activation occurs in `CausalSelfAttention.forward` at lines 600-601:

```python
attn_output = F.scaled_dot_product_attention(
    q, k, v,
    attn_mask=None,
    dropout_p=self.dropout if self.training else 0.0,
    is_causal=True,
    enable_gqa=(self.num_kv_heads != self.num_heads)  # GQA flag

)

```

The `enable_gqa` parameter is set dynamically based on whether the KV head count differs from the query head count. When `True`, PyTorch's SDPA implementation broadcasts the reduced KV heads across the query head groups.

## Complete Configuration Example

Here is a complete configuration script that demonstrates both Flash Attention and GQA enabled simultaneously:

```python
import torch
from torch.backends.cuda import enable_flash_sdp, enable_mem_efficient_sdp, enable_math_sdp
from train_gpt import GPT, Hyperparameters

# 1. Enable Flash Attention globally

enable_flash_sdp(True)
enable_mem_efficient_sdp(False)
enable_math_sdp(False)

# 2. Configure GQA via hyperparameters

class GQAConfig(Hyperparameters):
    num_heads = 12       # 12 query heads

    num_kv_heads = 3     # 3 KV heads = 4:1 grouping

# 3. Instantiate model

model = GPT(
    vocab_size=GQAConfig.vocab_size,
    num_layers=2,
    model_dim=768,
    num_heads=GQAConfig.num_heads,
    num_kv_heads=GQAConfig.num_kv_heads,
    mlp_mult=4,
    tie_embeddings=True,
    tied_embed_init_std=0.02,
    logit_softcap=30.0,
    rope_base=10000.0,
    qk_gain_init=1.5,
)

model.cuda().eval()

# 4. Verify configuration

print(f"Flash Attention enabled: {torch.backends.cuda.flash_sdp_enabled()}")
print(f"GQA enabled: {GQAConfig.num_kv_heads != GQAConfig.num_heads}")

# 5. Run forward pass

tokens = torch.randint(0, GQAConfig.vocab_size, (1, 32), device="cuda")
with torch.no_grad():
    output = model(tokens, tokens)
print(f"Output loss: {output.item():.4f}")

```

Alternatively, use environment variables to configure without modifying code:

```bash

# 12 query heads, 3 KV heads (4:1 GQA), Flash Attention enabled by default

NUM_HEADS=12 NUM_KV_HEADS=3 python train_gpt.py \
    --train_batch_tokens 524288 \
    --iterations 1000

```

## Validation and Prerequisites

Before running with Flash Attention and GQA, ensure your configuration meets these requirements:

- **PyTorch 2.0 or newer**: The `enable_flash_sdp` API and `enable_gqa` parameter require PyTorch 2.0+.
- **CUDA 11.8+ with Ampere architecture**: Flash Attention requires NVIDIA GPUs with compute capability ≥ 8.0 (A100, RTX 30xx/40xx series).
- **Head count divisibility**: In `CausalSelfAttention.__init__`, the code validates that `num_heads % num_kv_heads == 0`. This ensures uniform grouping of query heads across KV heads.

The validation logic appears in the `CausalSelfAttention` constructor:

```python
assert self.num_heads % self.num_kv_heads == 0, (
    f"num_heads ({self.num_heads}) must be divisible by "
    f"num_kv_heads ({self.num_kv_heads}) for GQA"
)

```

## Summary

- **Enable Flash Attention** by calling `enable_flash_sdp(True)` from `torch.backends.cuda` before model initialization, as implemented in [`train_gpt.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt.py) lines 64-68.
- **Activate GQA** by setting `num_kv_heads` lower than `num_heads` in the `Hyperparameters` class; the `CausalSelfAttention.forward` method automatically passes `enable_gqa=True` to `scaled_dot_product_attention` at lines 600-601.
- **Ensure compatibility** by using PyTorch 2.0+, CUDA 11.8+, and verifying that `num_heads` is divisible by `num_kv_heads`.
- **Configure via environment variables** by setting `NUM_HEADS` and `NUM_KV_HEADS` before launching [`train_gpt.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt.py).

## Frequently Asked Questions

### What is the difference between MHA, MQA, and GQA in the Parameter-Golf implementation?

**Multi-Head Attention (MHA)** uses unique key and value heads for every query head (`num_kv_heads = num_heads`). **Multi-Query Attention (MQA)** shares a single key and value head across all query heads (`num_kv_heads = 1`). **Grouped Query Attention (GQA)** is the intermediate case where `num_kv_heads` is set between 1 and `num_heads`, creating groups where multiple query heads share the same KV head. The Parameter-Golf code automatically detects which mode to use based on the `num_kv_heads` vs `num_heads` comparison.

### Does Flash Attention work with all GQA configurations?

Flash Attention supports GQA configurations **only when** the number of query heads is a multiple of the number of key/value heads. This is enforced in the `CausalSelfAttention.__init__` method where `assert self.num_heads % self.num_kv_heads == 0` validates the configuration. If this divisibility condition is not met, PyTorch will either fall back to a different SDPA implementation or raise an error depending on which backends are enabled.

### Can I enable Flash Attention without using GQA?

Yes. Flash Attention and GQA are **independent features** in the Parameter-Golf codebase. To use Flash Attention without GQA, set `num_kv_heads` equal to `num_heads` in your hyperparameters (or omit the `NUM_KV_HEADS` environment variable to use the default). The `enable_flash_sdp(True)` call will still activate the Flash Attention kernel, while the `enable_gqa` parameter will evaluate to `False` because `num_kv_heads == num_heads`, resulting in standard multi-head attention with Flash Attention optimization.

### Where does the actual GQA broadcasting happen in the code?

The GQA broadcasting logic is **handled internally by PyTorch's SDPA implementation**, not manually in the Parameter-Golf source. The Parameter-Golf code only sets the `enable_gqa=True` flag when calling `F.scaled_dot_product_attention` in [`train_gpt.py`](https://github.com/openai/parameter-golf/blob/main/train_gpt.py) lines 600-601. When this flag is true and the KV tensor dimensions indicate fewer heads than the query tensor, PyTorch automatically broadcasts the KV heads to match the query head groups during the attention computation. Your responsibility is only to ensure the input tensors have the correct shape: `[batch, num_kv_heads, seq_len, head_dim]` for keys/values versus `[batch, num_heads, seq_len, head_dim]` for queries.