How to Configure Flash Attention with Grouped Query Attention (GQA) in Parameter-Golf
You configure Flash Attention by calling enable_flash_sdp(True) from torch.backends.cuda and enable GQA by setting num_kv_heads lower than num_heads in the Hyperparameters class, which triggers the enable_gqa=True flag inside CausalSelfAttention.forward.
The openai/parameter-golf repository provides a minimal GPT training implementation that supports both Flash Attention (via PyTorch's SDPA dispatcher) and Grouped Query Attention (GQA) for memory-efficient inference. Understanding how to configure these features together allows you to optimize throughput while controlling KV-cache size.
Understanding Flash Attention and GQA in Parameter-Golf
Flash Attention is a memory-efficient attention algorithm implemented as a CUDA kernel in PyTorch 2.0+. When enabled, it replaces the standard scaled_dot_product_attention implementation with a fused kernel that avoids materializing the full attention matrix.
Grouped Query Attention reduces memory bandwidth during inference by sharing key and value heads across multiple query heads. In parameter-golf, GQA is activated whenever num_kv_heads differs from num_heads, creating a grouped attention pattern where each KV head serves multiple query heads.
Enabling Flash Attention in train_gpt.py
Flash Attention is enabled globally at the start of the training script through PyTorch's backend configuration.
The enable_flash_sdp Configuration
In train_gpt.py at lines 64-68, the script configures the SDPA (Scaled Dot Product Attention) dispatcher:
from torch.backends.cuda import enable_flash_sdp, enable_cudnn_sdp, enable_math_sdp, enable_mem_efficient_sdp
enable_flash_sdp(True) # Enable Flash Attention kernel
enable_mem_efficient_sdp(False) # Disable memory-efficient alternative
enable_math_sdp(False) # Disable fallback math implementation
enable_cudnn_sdp(False) # Disable CuDNN SDPA (optional)
This configuration ensures that all subsequent calls to F.scaled_dot_product_attention use the Flash Attention implementation when running on compatible CUDA hardware (Ampere or newer).
Configuring Grouped Query Attention (GQA)
GQA configuration involves two components: the hyperparameter definition and the runtime activation inside the attention mechanism.
Setting num_kv_heads vs num_heads
In the Hyperparameters class (lines 39-70 in train_gpt.py), you define the attention head structure:
class Hyperparameters:
num_heads: int = 8 # Total number of query heads
num_kv_heads: int = 4 # Number of key/value heads (GQA when < num_heads)
When num_kv_heads (4) is less than num_heads (8), the model configures a 4:1 grouping ratio where each KV head serves 2 query heads.
The enable_gqa Parameter in scaled_dot_product_attention
The actual GQA activation occurs in CausalSelfAttention.forward at lines 600-601:
attn_output = F.scaled_dot_product_attention(
q, k, v,
attn_mask=None,
dropout_p=self.dropout if self.training else 0.0,
is_causal=True,
enable_gqa=(self.num_kv_heads != self.num_heads) # GQA flag
)
The enable_gqa parameter is set dynamically based on whether the KV head count differs from the query head count. When True, PyTorch's SDPA implementation broadcasts the reduced KV heads across the query head groups.
Complete Configuration Example
Here is a complete configuration script that demonstrates both Flash Attention and GQA enabled simultaneously:
import torch
from torch.backends.cuda import enable_flash_sdp, enable_mem_efficient_sdp, enable_math_sdp
from train_gpt import GPT, Hyperparameters
# 1. Enable Flash Attention globally
enable_flash_sdp(True)
enable_mem_efficient_sdp(False)
enable_math_sdp(False)
# 2. Configure GQA via hyperparameters
class GQAConfig(Hyperparameters):
num_heads = 12 # 12 query heads
num_kv_heads = 3 # 3 KV heads = 4:1 grouping
# 3. Instantiate model
model = GPT(
vocab_size=GQAConfig.vocab_size,
num_layers=2,
model_dim=768,
num_heads=GQAConfig.num_heads,
num_kv_heads=GQAConfig.num_kv_heads,
mlp_mult=4,
tie_embeddings=True,
tied_embed_init_std=0.02,
logit_softcap=30.0,
rope_base=10000.0,
qk_gain_init=1.5,
)
model.cuda().eval()
# 4. Verify configuration
print(f"Flash Attention enabled: {torch.backends.cuda.flash_sdp_enabled()}")
print(f"GQA enabled: {GQAConfig.num_kv_heads != GQAConfig.num_heads}")
# 5. Run forward pass
tokens = torch.randint(0, GQAConfig.vocab_size, (1, 32), device="cuda")
with torch.no_grad():
output = model(tokens, tokens)
print(f"Output loss: {output.item():.4f}")
Alternatively, use environment variables to configure without modifying code:
# 12 query heads, 3 KV heads (4:1 GQA), Flash Attention enabled by default
NUM_HEADS=12 NUM_KV_HEADS=3 python train_gpt.py \
--train_batch_tokens 524288 \
--iterations 1000
Validation and Prerequisites
Before running with Flash Attention and GQA, ensure your configuration meets these requirements:
- PyTorch 2.0 or newer: The
enable_flash_sdpAPI andenable_gqaparameter require PyTorch 2.0+. - CUDA 11.8+ with Ampere architecture: Flash Attention requires NVIDIA GPUs with compute capability ≥ 8.0 (A100, RTX 30xx/40xx series).
- Head count divisibility: In
CausalSelfAttention.__init__, the code validates thatnum_heads % num_kv_heads == 0. This ensures uniform grouping of query heads across KV heads.
The validation logic appears in the CausalSelfAttention constructor:
assert self.num_heads % self.num_kv_heads == 0, (
f"num_heads ({self.num_heads}) must be divisible by "
f"num_kv_heads ({self.num_kv_heads}) for GQA"
)
Summary
- Enable Flash Attention by calling
enable_flash_sdp(True)fromtorch.backends.cudabefore model initialization, as implemented intrain_gpt.pylines 64-68. - Activate GQA by setting
num_kv_headslower thannum_headsin theHyperparametersclass; theCausalSelfAttention.forwardmethod automatically passesenable_gqa=Truetoscaled_dot_product_attentionat lines 600-601. - Ensure compatibility by using PyTorch 2.0+, CUDA 11.8+, and verifying that
num_headsis divisible bynum_kv_heads. - Configure via environment variables by setting
NUM_HEADSandNUM_KV_HEADSbefore launchingtrain_gpt.py.
Frequently Asked Questions
What is the difference between MHA, MQA, and GQA in the Parameter-Golf implementation?
Multi-Head Attention (MHA) uses unique key and value heads for every query head (num_kv_heads = num_heads). Multi-Query Attention (MQA) shares a single key and value head across all query heads (num_kv_heads = 1). Grouped Query Attention (GQA) is the intermediate case where num_kv_heads is set between 1 and num_heads, creating groups where multiple query heads share the same KV head. The Parameter-Golf code automatically detects which mode to use based on the num_kv_heads vs num_heads comparison.
Does Flash Attention work with all GQA configurations?
Flash Attention supports GQA configurations only when the number of query heads is a multiple of the number of key/value heads. This is enforced in the CausalSelfAttention.__init__ method where assert self.num_heads % self.num_kv_heads == 0 validates the configuration. If this divisibility condition is not met, PyTorch will either fall back to a different SDPA implementation or raise an error depending on which backends are enabled.
Can I enable Flash Attention without using GQA?
Yes. Flash Attention and GQA are independent features in the Parameter-Golf codebase. To use Flash Attention without GQA, set num_kv_heads equal to num_heads in your hyperparameters (or omit the NUM_KV_HEADS environment variable to use the default). The enable_flash_sdp(True) call will still activate the Flash Attention kernel, while the enable_gqa parameter will evaluate to False because num_kv_heads == num_heads, resulting in standard multi-head attention with Flash Attention optimization.
Where does the actual GQA broadcasting happen in the code?
The GQA broadcasting logic is handled internally by PyTorch's SDPA implementation, not manually in the Parameter-Golf source. The Parameter-Golf code only sets the enable_gqa=True flag when calling F.scaled_dot_product_attention in train_gpt.py lines 600-601. When this flag is true and the KV tensor dimensions indicate fewer heads than the query tensor, PyTorch automatically broadcasts the KV heads to match the query head groups during the attention computation. Your responsibility is only to ensure the input tensors have the correct shape: [batch, num_kv_heads, seq_len, head_dim] for keys/values versus [batch, num_heads, seq_len, head_dim] for queries.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →