Core Components of the Whisper Model: AudioEncoder and TextDecoder Explained
The Whisper model consists of two primary neural components: an AudioEncoder that converts mel-spectrograms into latent embeddings, and a TextDecoder that generates transcription tokens via cross-attention to those embeddings.
The openai/whisper repository implements a transformer-based automatic speech recognition (ASR) system. Understanding the core components of the Whisper model—specifically how the AudioEncoder and TextDecoder interact—is essential for customizing inference, debugging outputs, or fine-tuning the architecture.
AudioEncoder: Converting Sound to Latent Representations
The AudioEncoder class, defined in whisper/model.py, processes raw mel-spectrogram inputs and produces a sequence of dense audio embeddings. This module uses a convolutional front-end followed by a stack of transformer blocks.
Convolutional Front-End
The encoder begins with two 1-D convolutional layers that down-sample the temporal dimension while expanding the feature depth. According to the source code at lines 179-180, these are defined as:
self.conv1– initial convolution withkernel_size=3andstride=1self.conv2– strided convolution withkernel_size=3andstride=2that reduces the time steps by half
These layers extract local spectral patterns before the transformer layers model global temporal dependencies.
Positional Embeddings and Transformer Blocks
After the convolutional layers, the encoder adds learned sinusoidal positional embeddings (line 181) to inject temporal position information. The core computation occurs in a stack of ResidualAttentionBlock layers (lines 183-185), each containing:
- Self-attention – allows the model to attend to any position in the audio sequence
- MLP – a feed-forward network that processes each position independently
The number of blocks varies by model size (tiny, base, small, medium, large).
Forward Pass Implementation
The forward method (lines 188-204) implements the complete encoding pipeline:
def forward(self, x: Tensor):
# x: (batch, n_mels, n_ctx)
x = F.gelu(self.conv1(x))
x = F.gelu(self.conv2(x))
x = x.permute(0, 2, 1) # (batch, time, dim)
x = (x + self.positional_embedding).to(x.dtype)
for block in self.blocks:
x = block(x) # self-attention + MLP
return self.ln_post(x)
The final LayerNorm (ln_post, line 186) stabilizes the output, producing a tensor of shape (batch, n_audio_ctx, n_audio_state) that serves as the input to the TextDecoder.
TextDecoder: Generating Tokens from Audio Embeddings
The TextDecoder class (also in whisper/model.py) is an autoregressive transformer that predicts the next text token given previous tokens and the audio embeddings from the encoder. It employs cross-attention layers to query the audio representation while maintaining causal self-attention over text positions.
Token and Positional Embeddings
The decoder begins by converting token IDs into dense vectors using self.token_embedding (line 213), sharing the same dimensionality as the encoder output (n_state). It adds learned positional embeddings (line 214) to encode sequence position, sliced according to the current offset when using key-value caching.
Cross-Attention Mechanism
Unlike the encoder, each ResidualAttentionBlock in the decoder (lines 217-220) is initialized with cross_attention=True. This adds a multi-head attention layer where:
- Queries come from the decoder's hidden states
- Keys and Values come from the encoder's audio embeddings (
xaparameter)
This mechanism allows the model to align text generation with specific acoustic features.
Causal Masking and Output Projection
The decoder enforces autoregressive generation through a triangular causal mask registered as self.mask (lines 224-226). This mask prevents the model from attending to future token positions during self-attention.
After processing through the transformer blocks and final layer normalization (self.ln, line 222), the model projects hidden states back to vocabulary logits using the transposed token embedding matrix:
logits = (x @ self.token_embedding.weight.t()).float()
This weight tying between input embeddings and output projection is a standard transformer optimization.
The Whisper Class: Wiring Encoder and Decoder Together
The top-level Whisper class (lines 256+ in whisper/model.py) orchestrates the two core components. It exposes several key methods:
embed_audio(line 87) – wrapsAudioEncoder.forwardto process mel-spectrogramslogits(line 90) – wrapsTextDecoder.forwardto compute next-token probabilitiesforward(lines 94-96) – end-to-end inference used by high-level APIs
The class also imports utility functions from whisper/decoding.py, binding decode, transcribe, and detect_language as methods (lines 43-45).
Practical Code Examples
Loading the Model
from whisper import load_model
# Load the small English-only model (downloaded automatically)
model = load_model("small.en", device="cpu") # or "cuda" if available
load_model resides in whisper/__init__.py and constructs a Whisper object with pretrained weights.
Encoding Audio Features
import numpy as np
from whisper.audio import log_mel_spectrogram
# Assume `wav` is a raw waveform (np.ndarray, shape (samples,))
mel = log_mel_spectrogram(wav, n_mels=model.dims.n_mels) # → (n_mels, n_ctx)
audio_features = model.embed_audio(mel) # shape (1, n_audio_ctx, n_audio_state)
print("Audio embedding shape:", audio_features.shape)
model.embed_audio forwards the mel-spectrogram through AudioEncoder (see lines 87-90 in whisper/model.py).
Generating Decoder Logits
from whisper.tokenizer import get_tokenizer
# Obtain a tokenizer for the model
tokenizer = get_tokenizer(model.is_multilingual,
num_languages=model.num_languages,
language="en",
task="transcribe")
# Start with the BOS token (<|start|> = tokenizer.sot)
tokens = torch.tensor([[tokenizer.sot]]) # shape (1, 1)
logits = model.logits(tokens, audio_features) # shape (1, 1, vocab_size)
print("Logits shape:", logits.shape)
model.logits runs the TextDecoder (see lines 90-93 in whisper/model.py).
End-to-End Transcription
from whisper import transcribe
result = transcribe(model, "audio/example.wav")
print(result["text"]) # printed transcription
print(result["language"]) # detected language (if multilingual)
transcribe (implemented in whisper/transcribe.py) orchestrates chunked decoding, language detection, and optional word-level timestamps, all built on top of the encoder/decoder pair.
Key Source Files in openai/whisper
| File | Primary Role | Link |
|---|---|---|
whisper/model.py |
Core model definitions: AudioEncoder, TextDecoder, Whisper (assembly) |
view file |
whisper/audio.py |
Audio preprocessing (log_mel_spectrogram, pad_or_trim, constants) |
view file |
whisper/tokenizer.py |
Vocabulary, token-ID mapping, language handling | view file |
whisper/decoding.py |
Beam search, sampling, language detection, decode API |
view file |
whisper/transcribe.py |
High-level transcription pipeline that uses the encoder/decoder | view file |
whisper/__init__.py |
Public entry points (load_model, convenience wrappers) |
view file |
These files together constitute the full Whisper inference stack, with the Audio Encoder and Text Decoder forming the heart of the neural model.
Summary
- AudioEncoder processes mel-spectrograms through convolutional layers and transformer blocks, outputting latent audio embeddings of shape
(batch, n_audio_ctx, n_audio_state). - TextDecoder is an autoregressive transformer that uses cross-attention to query audio embeddings while applying causal masking to generate text tokens.
- Both components are defined in
whisper/model.pyand orchestrated by theWhisperclass, which exposesembed_audioandlogitsmethods for direct access to encoder and decoder outputs. - High-level APIs like
transcribeanddecodebuild upon these core components to provide end-to-end speech recognition.
Frequently Asked Questions
What is the difference between the AudioEncoder and TextDecoder in Whisper?
The AudioEncoder is a feed-forward convolutional-transformer hybrid that processes entire audio segments in parallel, converting mel-spectrograms into dense vector representations. The TextDecoder is an autoregressive transformer that generates tokens one at a time, using cross-attention to pull information from the encoder's output while maintaining causal self-attention over previously generated tokens.
How does the TextDecoder attend to audio features during transcription?
The TextDecoder uses cross-attention layers (enabled by cross_attention=True in ResidualAttentionBlock at lines 217-220 of whisper/model.py) where queries originate from the decoder's hidden states and keys/values come from the encoder's audio embeddings (xa parameter). This mechanism allows each text token to attend to relevant acoustic patterns in the input audio.
What input format does the AudioEncoder expect?
The AudioEncoder expects a log-mel spectrogram tensor of shape (batch, n_mels, n_ctx), typically generated by whisper.audio.log_mel_spectrogram. The n_mels dimension is usually 80 or 128 depending on the model configuration, and n_ctx represents the time frames after padding or trimming to the model's context window (e.g., 1500 frames for 30 seconds of audio).
Can I use the AudioEncoder or TextDecoder independently?
Yes. The Whisper class exposes embed_audio() (line 87 in whisper/model.py) to run only the encoder and obtain audio embeddings, and logits() (line 90) to run the decoder on specific token sequences given encoder outputs. This allows advanced users to implement custom decoding strategies or use the encoder for audio representation learning separate from the standard transcription pipeline.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →