# SDPA (Scaled Dot Product Attention) in OpenAI Whisper: Implementation and Usage

> **SDPA (Scaled Dot Product Attention) is the core mathematical operation powering Whisper's Transformer attention layers, which OpenAI implements using PyTorch's fused `scaled_dot_product_attention` kernel for performance while...

- Repository: [OpenAI/whisper](https://github.com/openai/whisper)
- Tags: 
- Published: 2026-02-27

---

**SDPA (Scaled Dot Product Attention) is the core mathematical operation powering Whisper's Transformer attention layers, which OpenAI implements using PyTorch's fused `scaled_dot_product_attention` kernel for performance while maintaining a manual fallback for extracting raw attention scores during word-level timestamp alignment.**

Whisper is an open-source automatic speech recognition (ASR) system built on the Transformer architecture. At the heart of both its **AudioEncoder** and **TextDecoder** lies **SDPA (Scaled Dot Product Attention)**, the fundamental computation that allows the model to focus on relevant audio features and token relationships. According to the `openai/whisper` source code, this operation is implemented through a hybrid approach that leverages hardware acceleration when possible while preserving flexibility for analysis tasks.

## What is Scaled Dot Product Attention?

### The Mathematical Foundation

**Scaled Dot Product Attention** computes a weighted sum of values (**V**) based on the similarity between queries (**Q**) and keys (**K**). Given input tensors, the operation follows this formula:

```

Attention(Q, K, V) = softmax((Q × K^T) / sqrt(d_k)) × V

```

Here, **d_k** represents the dimensionality of each attention head. The division by `sqrt(d_k)`—the *scale*—prevents dot-product values from growing too large, which would otherwise push the **softmax** function into saturation and impede gradient flow during training.

### PyTorch's Fused Kernel Implementation

PyTorch provides a highly optimized GPU kernel accessible via `torch.nn.functional.scaled_dot_product_attention`. This implementation fuses the scaling, softmax, and weighted sum operations into a single kernel launch, dramatically reducing memory traffic and latency compared to executing these steps sequentially. Whisper detects and utilizes this kernel automatically when available.

## How Whisper Implements SDPA in model.py

### Architecture Overview

Whisper's architecture consists of two main components that both rely on SDPA:

- **AudioEncoder**: Processes mel-spectrogram inputs through a stack of `ResidualAttentionBlock` layers using **self-attention**.
- **TextDecoder**: Generates text tokens using `ResidualAttentionBlock` layers with both **self-attention** (for language modeling) and **cross-attention** (to attend over encoded audio features).

Both attention types are implemented in the **`MultiHeadAttention`** class defined in [`whisper/model.py`](https://github.com/openai/whisper/blob/main/whisper/model.py).

### SDPA Detection and Routing Logic

The codebase uses conditional compilation to handle different PyTorch versions. In [`whisper/model.py`](https://github.com/openai/whisper/blob/main/whisper/model.py) lines 16-22, Whisper attempts to import the optimized kernel and sets a global flag:

```python

# From whisper/model.py L16-22

try:
    from torch.nn.functional import scaled_dot_product_attention as sdpa
    SDPA_AVAILABLE = True
except ImportError:
    SDPA_AVAILABLE = False

```

The `MultiHeadAttention` class (L123-133) delegates computation to a helper function `qkv_attention`, which decides between the fused kernel and manual implementation based on the `SDPA_AVAILABLE` flag and an instance variable `use_sdpa`.

### The Manual Fallback Path

When the SDPA kernel is unavailable or explicitly disabled, Whisper falls back to a manual implementation (L30-38) that explicitly executes each attention step:

```python

# Conceptual excerpt from whisper/model.py L30-38

scale = (n_state // n_head) ** -0.25
q = q * scale
k = k * scale

# Compute raw attention scores (Q·K^T)

qk = q @ k.transpose(-2, -1)

# Apply optional masking and softmax

w = F.softmax(qk + mask, dim=-1)

# Weighted sum with values

return w @ v

```

This fallback applies the scaling factor `(d_state // n_head)**-0.25` directly to queries and keys before matrix multiplication, then explicitly computes the **softmax** over the attention scores.

## Disabling SDPA for Raw Attention Extraction

### Why Disable the Fused Kernel?

The fused SDPA kernel operates as a black box and does not expose intermediate values. However, Whisper's word-level timestamp alignment algorithm requires access to the **raw Q·K matrix** (attention scores before softmax normalization). 

To support this, Whisper provides the `disable_sdpa()` context manager (L71-78 in [`model.py`](https://github.com/openai/whisper/blob/main/model.py)). When active, this forces the manual fallback path, allowing forward hooks to capture the raw attention matrices.

### Usage in timing.py

In [`whisper/timing.py`](https://github.com/openai/whisper/blob/main/whisper/timing.py) lines 94-98, the alignment routine explicitly disables SDPA to extract raw attention weights:

```python

# From whisper/timing.py L94-98

with disable_sdpa():
    # Forward pass captures raw qk matrices via hooks

    # for cross-attention alignment between audio and text

    logits, qk = model(mel_tokens, text_tokens)

```

Without this mechanism, the timing code could not access the pre-softmax attention scores necessary for forced alignment.

## Code Examples

### Using PyTorch's SDPA Kernel Directly

```python
import torch
from torch.nn.functional import scaled_dot_product_attention

# Example dimensions: batch=2, heads=8, seq_len=100, head_dim=64

Q = torch.randn(2, 8, 100, 64, device="cuda")
K = torch.randn(2, 8, 100, 64, device="cuda")
V = torch.randn(2, 8, 100, 64, device="cuda")

# Compute with fused kernel (causal masking for autoregressive decoding)

output = scaled_dot_product_attention(Q, K, V, is_causal=True)
print(output.shape)  # torch.Size([2, 8, 100, 64])

```

### Controlling SDPA in Whisper's MultiHeadAttention

```python
from whisper.model import MultiHeadAttention, disable_sdpa
import torch

# Initialize attention layer (state dimension 512, 8 heads)

mha = MultiHeadAttention(n_state=512, n_head=8)

# Sample input: batch=2, sequence=150, features=512

x = torch.randn(2, 150, 512)

# Default path: uses SDPA if PyTorch supports it

output_sdpa = mha(x)

# Force manual fallback to extract raw QK scores

with disable_sdpa():
    output_manual, qk_scores = mha(x)
    # qk_scores contains raw attention logits before softmax

```

### Extracting Attention for Word-Level Timestamps

```python
from whisper import load_model
from whisper.timing import find_alignment
from whisper.audio import load_audio, pad_or_trim, log_mel_spectrogram

# Load model and audio

model = load_model("base")
audio = load_audio("speech.wav")
mel = log_mel_spectrogram(pad_or_trim(audio))

# Tokenize transcript

text = "This is a test transcription"
tokens = model.tokenizer.encode(text)

# Alignment internally disables SDPA to capture raw attention

alignments = find_alignment(model, model.tokenizer, tokens, mel, num_frames=mel.shape[-1])

for word in alignments:
    print(f"{word.word}: {word.start:.2f}s - {word.end:.2f}s")

```

## Summary

- **SDPA** is the fundamental attention computation in Whisper's Transformer architecture, scaling query-key dot products by `sqrt(d_k)` to maintain stable gradients.
- Whisper automatically detects and uses PyTorch's `scaled_dot_product_attention` fused kernel in [`whisper/model.py`](https://github.com/openai/whisper/blob/main/whisper/model.py) for optimal GPU performance.
- The manual fallback path explicitly computes `softmax((Q × K^T) / sqrt(d_k)) × V` and is used when the optimized kernel is unavailable.
- Word-level timestamp extraction requires raw attention scores, which forces Whisper to temporarily **disable SDPA** via the `disable_sdpa()` context manager in [`whisper/timing.py`](https://github.com/openai/whisper/blob/main/whisper/timing.py).
- The `MultiHeadAttention` class controls this behavior through the `use_sdpa` flag, allowing runtime switching between performance and introspection.

## Frequently Asked Questions

### What does SDPA stand for in the context of Whisper?

**SDPA** stands for **Scaled Dot Product Attention**, the standard attention mechanism introduced in the original "Attention Is All You Need" Transformer paper. In Whisper's implementation, it refers specifically to the computation that calculates compatibility between queries and keys, scales the result by the square root of the head dimension, applies softmax, and weights the values accordingly.

### Why does Whisper disable SDPA during word-level timestamp alignment?

Whisper disables the fused SDPA kernel during alignment because the optimized `torch.nn.functional.scaled_dot_product_attention` operates as a monolithic GPU kernel that does not return intermediate tensors. The alignment algorithm in [`whisper/timing.py`](https://github.com/openai/whisper/blob/main/whisper/timing.py) requires the raw **Q·K** attention scores (before softmax normalization) to compute cross-correlations between audio and text features. By using the `disable_sdpa()` context manager, Whisper forces the manual fallback path that exposes these raw matrices through forward hooks.

### How can I verify if Whisper is using the SDPA kernel or the fallback?

You can check the global `SDPA_AVAILABLE` flag imported from `whisper.model`, which is set to `True` only when PyTorch's `scaled_dot_product_attention` is successfully imported. Additionally, you can inspect the `MultiHeadAttention.use_sdpa` attribute on any attention layer. To force the fallback for debugging or analysis, wrap your inference call in the `disable_sdpa()` context manager provided in the same module.

### What is the performance difference between SDPA and the manual fallback?

The fused SDPA kernel significantly reduces memory bandwidth and kernel launch overhead by combining the scale, matmul, softmax, and final matmul operations into a single GPU kernel. In the manual fallback path implemented in [`whisper/model.py`](https://github.com/openai/whisper/blob/main/whisper/model.py) lines 30-38, each operation executes separately, creating additional memory traffic and synchronization points. While the mathematical results are identical (allowing both paths to maintain model accuracy), the SDPA kernel typically provides substantial latency improvements during both training and inference on modern GPUs.