How to Configure Flash Attention with Grouped Query Attention (GQA) in Parameter-Golf

You configure Flash Attention by calling enable_flash_sdp(True) from torch.backends.cuda and enable GQA by setting num_kv_heads lower than num_heads in the Hyperparameters class, which triggers the enable_gqa=True flag inside CausalSelfAttention.forward.

The openai/parameter-golf repository provides a minimal GPT training implementation that supports both Flash Attention (via PyTorch's SDPA dispatcher) and Grouped Query Attention (GQA) for memory-efficient inference. Understanding how to configure these features together allows you to optimize throughput while controlling KV-cache size.

Understanding Flash Attention and GQA in Parameter-Golf

Flash Attention is a memory-efficient attention algorithm implemented as a CUDA kernel in PyTorch 2.0+. When enabled, it replaces the standard scaled_dot_product_attention implementation with a fused kernel that avoids materializing the full attention matrix.

Grouped Query Attention reduces memory bandwidth during inference by sharing key and value heads across multiple query heads. In parameter-golf, GQA is activated whenever num_kv_heads differs from num_heads, creating a grouped attention pattern where each KV head serves multiple query heads.

Enabling Flash Attention in train_gpt.py

Flash Attention is enabled globally at the start of the training script through PyTorch's backend configuration.

The enable_flash_sdp Configuration

In train_gpt.py at lines 64-68, the script configures the SDPA (Scaled Dot Product Attention) dispatcher:

from torch.backends.cuda import enable_flash_sdp, enable_cudnn_sdp, enable_math_sdp, enable_mem_efficient_sdp

enable_flash_sdp(True)            # Enable Flash Attention kernel

enable_mem_efficient_sdp(False)   # Disable memory-efficient alternative

enable_math_sdp(False)            # Disable fallback math implementation

enable_cudnn_sdp(False)           # Disable CuDNN SDPA (optional)

This configuration ensures that all subsequent calls to F.scaled_dot_product_attention use the Flash Attention implementation when running on compatible CUDA hardware (Ampere or newer).

Configuring Grouped Query Attention (GQA)

GQA configuration involves two components: the hyperparameter definition and the runtime activation inside the attention mechanism.

Setting num_kv_heads vs num_heads

In the Hyperparameters class (lines 39-70 in train_gpt.py), you define the attention head structure:

class Hyperparameters:
    num_heads: int = 8        # Total number of query heads

    num_kv_heads: int = 4     # Number of key/value heads (GQA when < num_heads)

When num_kv_heads (4) is less than num_heads (8), the model configures a 4:1 grouping ratio where each KV head serves 2 query heads.

The enable_gqa Parameter in scaled_dot_product_attention

The actual GQA activation occurs in CausalSelfAttention.forward at lines 600-601:

attn_output = F.scaled_dot_product_attention(
    q, k, v,
    attn_mask=None,
    dropout_p=self.dropout if self.training else 0.0,
    is_causal=True,
    enable_gqa=(self.num_kv_heads != self.num_heads)  # GQA flag

)

The enable_gqa parameter is set dynamically based on whether the KV head count differs from the query head count. When True, PyTorch's SDPA implementation broadcasts the reduced KV heads across the query head groups.

Complete Configuration Example

Here is a complete configuration script that demonstrates both Flash Attention and GQA enabled simultaneously:

import torch
from torch.backends.cuda import enable_flash_sdp, enable_mem_efficient_sdp, enable_math_sdp
from train_gpt import GPT, Hyperparameters

# 1. Enable Flash Attention globally

enable_flash_sdp(True)
enable_mem_efficient_sdp(False)
enable_math_sdp(False)

# 2. Configure GQA via hyperparameters

class GQAConfig(Hyperparameters):
    num_heads = 12       # 12 query heads

    num_kv_heads = 3     # 3 KV heads = 4:1 grouping

# 3. Instantiate model

model = GPT(
    vocab_size=GQAConfig.vocab_size,
    num_layers=2,
    model_dim=768,
    num_heads=GQAConfig.num_heads,
    num_kv_heads=GQAConfig.num_kv_heads,
    mlp_mult=4,
    tie_embeddings=True,
    tied_embed_init_std=0.02,
    logit_softcap=30.0,
    rope_base=10000.0,
    qk_gain_init=1.5,
)

model.cuda().eval()

# 4. Verify configuration

print(f"Flash Attention enabled: {torch.backends.cuda.flash_sdp_enabled()}")
print(f"GQA enabled: {GQAConfig.num_kv_heads != GQAConfig.num_heads}")

# 5. Run forward pass

tokens = torch.randint(0, GQAConfig.vocab_size, (1, 32), device="cuda")
with torch.no_grad():
    output = model(tokens, tokens)
print(f"Output loss: {output.item():.4f}")

Alternatively, use environment variables to configure without modifying code:


# 12 query heads, 3 KV heads (4:1 GQA), Flash Attention enabled by default

NUM_HEADS=12 NUM_KV_HEADS=3 python train_gpt.py \
    --train_batch_tokens 524288 \
    --iterations 1000

Validation and Prerequisites

Before running with Flash Attention and GQA, ensure your configuration meets these requirements:

  • PyTorch 2.0 or newer: The enable_flash_sdp API and enable_gqa parameter require PyTorch 2.0+.
  • CUDA 11.8+ with Ampere architecture: Flash Attention requires NVIDIA GPUs with compute capability ≥ 8.0 (A100, RTX 30xx/40xx series).
  • Head count divisibility: In CausalSelfAttention.__init__, the code validates that num_heads % num_kv_heads == 0. This ensures uniform grouping of query heads across KV heads.

The validation logic appears in the CausalSelfAttention constructor:

assert self.num_heads % self.num_kv_heads == 0, (
    f"num_heads ({self.num_heads}) must be divisible by "
    f"num_kv_heads ({self.num_kv_heads}) for GQA"
)

Summary

  • Enable Flash Attention by calling enable_flash_sdp(True) from torch.backends.cuda before model initialization, as implemented in train_gpt.py lines 64-68.
  • Activate GQA by setting num_kv_heads lower than num_heads in the Hyperparameters class; the CausalSelfAttention.forward method automatically passes enable_gqa=True to scaled_dot_product_attention at lines 600-601.
  • Ensure compatibility by using PyTorch 2.0+, CUDA 11.8+, and verifying that num_heads is divisible by num_kv_heads.
  • Configure via environment variables by setting NUM_HEADS and NUM_KV_HEADS before launching train_gpt.py.

Frequently Asked Questions

What is the difference between MHA, MQA, and GQA in the Parameter-Golf implementation?

Multi-Head Attention (MHA) uses unique key and value heads for every query head (num_kv_heads = num_heads). Multi-Query Attention (MQA) shares a single key and value head across all query heads (num_kv_heads = 1). Grouped Query Attention (GQA) is the intermediate case where num_kv_heads is set between 1 and num_heads, creating groups where multiple query heads share the same KV head. The Parameter-Golf code automatically detects which mode to use based on the num_kv_heads vs num_heads comparison.

Does Flash Attention work with all GQA configurations?

Flash Attention supports GQA configurations only when the number of query heads is a multiple of the number of key/value heads. This is enforced in the CausalSelfAttention.__init__ method where assert self.num_heads % self.num_kv_heads == 0 validates the configuration. If this divisibility condition is not met, PyTorch will either fall back to a different SDPA implementation or raise an error depending on which backends are enabled.

Can I enable Flash Attention without using GQA?

Yes. Flash Attention and GQA are independent features in the Parameter-Golf codebase. To use Flash Attention without GQA, set num_kv_heads equal to num_heads in your hyperparameters (or omit the NUM_KV_HEADS environment variable to use the default). The enable_flash_sdp(True) call will still activate the Flash Attention kernel, while the enable_gqa parameter will evaluate to False because num_kv_heads == num_heads, resulting in standard multi-head attention with Flash Attention optimization.

Where does the actual GQA broadcasting happen in the code?

The GQA broadcasting logic is handled internally by PyTorch's SDPA implementation, not manually in the Parameter-Golf source. The Parameter-Golf code only sets the enable_gqa=True flag when calling F.scaled_dot_product_attention in train_gpt.py lines 600-601. When this flag is true and the KV tensor dimensions indicate fewer heads than the query tensor, PyTorch automatically broadcasts the KV heads to match the query head groups during the attention computation. Your responsibility is only to ensure the input tensors have the correct shape: [batch, num_kv_heads, seq_len, head_dim] for keys/values versus [batch, num_heads, seq_len, head_dim] for queries.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →