SDPA (Scaled Dot Product Attention) in OpenAI Whisper: Implementation and Usage

February 27, 2026 openai/whisper ↗

SDPA (Scaled Dot Product Attention) is the core mathematical operation powering Whisper's Transformer attention layers, which OpenAI implements using PyTorch's fused scaled_dot_product_attention kernel for performance while maintaining a manual fallback for extracting raw attention scores during word-level timestamp alignment.

Whisper is an open-source automatic speech recognition (ASR) system built on the Transformer architecture. At the heart of both its AudioEncoder and TextDecoder lies SDPA (Scaled Dot Product Attention), the fundamental computation that allows the model to focus on relevant audio features and token relationships. According to the openai/whisper source code, this operation is implemented through a hybrid approach that leverages hardware acceleration when possible while preserving flexibility for analysis tasks.

What is Scaled Dot Product Attention?

The Mathematical Foundation

Scaled Dot Product Attention computes a weighted sum of values (V) based on the similarity between queries (Q) and keys (K). Given input tensors, the operation follows this formula:


Attention(Q, K, V) = softmax((Q × K^T) / sqrt(d_k)) × V

Here, d_k represents the dimensionality of each attention head. The division by sqrt(d_k)—the scale—prevents dot-product values from growing too large, which would otherwise push the softmax function into saturation and impede gradient flow during training.

PyTorch's Fused Kernel Implementation

PyTorch provides a highly optimized GPU kernel accessible via torch.nn.functional.scaled_dot_product_attention. This implementation fuses the scaling, softmax, and weighted sum operations into a single kernel launch, dramatically reducing memory traffic and latency compared to executing these steps sequentially. Whisper detects and utilizes this kernel automatically when available.

How Whisper Implements SDPA in model.py

Architecture Overview

Whisper's architecture consists of two main components that both rely on SDPA:

AudioEncoder: Processes mel-spectrogram inputs through a stack of ResidualAttentionBlock layers using self-attention.
TextDecoder: Generates text tokens using ResidualAttentionBlock layers with both self-attention (for language modeling) and cross-attention (to attend over encoded audio features).

Both attention types are implemented in the MultiHeadAttention class defined in whisper/model.py.

SDPA Detection and Routing Logic

The codebase uses conditional compilation to handle different PyTorch versions. In whisper/model.py lines 16-22, Whisper attempts to import the optimized kernel and sets a global flag:


# From whisper/model.py L16-22

try:
    from torch.nn.functional import scaled_dot_product_attention as sdpa
    SDPA_AVAILABLE = True
except ImportError:
    SDPA_AVAILABLE = False

The MultiHeadAttention class (L123-133) delegates computation to a helper function qkv_attention, which decides between the fused kernel and manual implementation based on the SDPA_AVAILABLE flag and an instance variable use_sdpa.

The Manual Fallback Path

When the SDPA kernel is unavailable or explicitly disabled, Whisper falls back to a manual implementation (L30-38) that explicitly executes each attention step:


# Conceptual excerpt from whisper/model.py L30-38

scale = (n_state // n_head) ** -0.25
q = q * scale
k = k * scale

# Compute raw attention scores (Q·K^T)

qk = q @ k.transpose(-2, -1)

# Apply optional masking and softmax

w = F.softmax(qk + mask, dim=-1)

# Weighted sum with values

return w @ v

This fallback applies the scaling factor (d_state // n_head)**-0.25 directly to queries and keys before matrix multiplication, then explicitly computes the softmax over the attention scores.

Disabling SDPA for Raw Attention Extraction

Why Disable the Fused Kernel?

The fused SDPA kernel operates as a black box and does not expose intermediate values. However, Whisper's word-level timestamp alignment algorithm requires access to the raw Q·K matrix (attention scores before softmax normalization).

To support this, Whisper provides the disable_sdpa() context manager (L71-78 in model.py). When active, this forces the manual fallback path, allowing forward hooks to capture the raw attention matrices.

Usage in timing.py

In whisper/timing.py lines 94-98, the alignment routine explicitly disables SDPA to extract raw attention weights:


# From whisper/timing.py L94-98

with disable_sdpa():
    # Forward pass captures raw qk matrices via hooks

    # for cross-attention alignment between audio and text

    logits, qk = model(mel_tokens, text_tokens)

Without this mechanism, the timing code could not access the pre-softmax attention scores necessary for forced alignment.

Code Examples

Using PyTorch's SDPA Kernel Directly

import torch
from torch.nn.functional import scaled_dot_product_attention

# Example dimensions: batch=2, heads=8, seq_len=100, head_dim=64

Q = torch.randn(2, 8, 100, 64, device="cuda")
K = torch.randn(2, 8, 100, 64, device="cuda")
V = torch.randn(2, 8, 100, 64, device="cuda")

# Compute with fused kernel (causal masking for autoregressive decoding)

output = scaled_dot_product_attention(Q, K, V, is_causal=True)
print(output.shape)  # torch.Size([2, 8, 100, 64])

Controlling SDPA in Whisper's MultiHeadAttention

from whisper.model import MultiHeadAttention, disable_sdpa
import torch

# Initialize attention layer (state dimension 512, 8 heads)

mha = MultiHeadAttention(n_state=512, n_head=8)

# Sample input: batch=2, sequence=150, features=512

x = torch.randn(2, 150, 512)

# Default path: uses SDPA if PyTorch supports it

output_sdpa = mha(x)

# Force manual fallback to extract raw QK scores

with disable_sdpa():
    output_manual, qk_scores = mha(x)
    # qk_scores contains raw attention logits before softmax

Extracting Attention for Word-Level Timestamps

from whisper import load_model
from whisper.timing import find_alignment
from whisper.audio import load_audio, pad_or_trim, log_mel_spectrogram

# Load model and audio

model = load_model("base")
audio = load_audio("speech.wav")
mel = log_mel_spectrogram(pad_or_trim(audio))

# Tokenize transcript

text = "This is a test transcription"
tokens = model.tokenizer.encode(text)

# Alignment internally disables SDPA to capture raw attention

alignments = find_alignment(model, model.tokenizer, tokens, mel, num_frames=mel.shape[-1])

for word in alignments:
    print(f"{word.word}: {word.start:.2f}s - {word.end:.2f}s")

Summary

SDPA is the fundamental attention computation in Whisper's Transformer architecture, scaling query-key dot products by sqrt(d_k) to maintain stable gradients.
Whisper automatically detects and uses PyTorch's scaled_dot_product_attention fused kernel in whisper/model.py for optimal GPU performance.
The manual fallback path explicitly computes softmax((Q × K^T) / sqrt(d_k)) × V and is used when the optimized kernel is unavailable.
Word-level timestamp extraction requires raw attention scores, which forces Whisper to temporarily disable SDPA via the disable_sdpa() context manager in whisper/timing.py.
The MultiHeadAttention class controls this behavior through the use_sdpa flag, allowing runtime switching between performance and introspection.

Frequently Asked Questions

What does SDPA stand for in the context of Whisper?

SDPA stands for Scaled Dot Product Attention, the standard attention mechanism introduced in the original "Attention Is All You Need" Transformer paper. In Whisper's implementation, it refers specifically to the computation that calculates compatibility between queries and keys, scales the result by the square root of the head dimension, applies softmax, and weights the values accordingly.

Why does Whisper disable SDPA during word-level timestamp alignment?

Whisper disables the fused SDPA kernel during alignment because the optimized torch.nn.functional.scaled_dot_product_attention operates as a monolithic GPU kernel that does not return intermediate tensors. The alignment algorithm in whisper/timing.py requires the raw Q·K attention scores (before softmax normalization) to compute cross-correlations between audio and text features. By using the disable_sdpa() context manager, Whisper forces the manual fallback path that exposes these raw matrices through forward hooks.

How can I verify if Whisper is using the SDPA kernel or the fallback?

You can check the global SDPA_AVAILABLE flag imported from whisper.model, which is set to True only when PyTorch's scaled_dot_product_attention is successfully imported. Additionally, you can inspect the MultiHeadAttention.use_sdpa attribute on any attention layer. To force the fallback for debugging or analysis, wrap your inference call in the disable_sdpa() context manager provided in the same module.

What is the performance difference between SDPA and the manual fallback?

The fused SDPA kernel significantly reduces memory bandwidth and kernel launch overhead by combining the scale, matmul, softmax, and final matmul operations into a single GPU kernel. In the manual fallback path implemented in whisper/model.py lines 30-38, each operation executes separately, creating additional memory traffic and synchronization points. While the mathematical results are identical (allowing both paths to maintain model accuracy), the SDPA kernel typically provides substantial latency improvements during both training and inference on modern GPUs.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:

curl -s "https://instagit.com/install.md"

Add to your MCP client configuration:

{
  "mcpServers": {
    "instagit": {
      "command": "npx",
      "args": ["-y", "instagit@latest"]
    }
  }
}

Ask your agent:

"Use Instagit MCP to understand how openai/whisper works."

Works with

Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →