Core Components of the Whisper Model: AudioEncoder and TextDecoder Explained

The Whisper model consists of two primary neural components: an AudioEncoder that converts mel-spectrograms into latent embeddings, and a TextDecoder that generates transcription tokens via cross-attention to those embeddings.

The openai/whisper repository implements a transformer-based automatic speech recognition (ASR) system. Understanding the core components of the Whisper model—specifically how the AudioEncoder and TextDecoder interact—is essential for customizing inference, debugging outputs, or fine-tuning the architecture.

AudioEncoder: Converting Sound to Latent Representations

The AudioEncoder class, defined in whisper/model.py, processes raw mel-spectrogram inputs and produces a sequence of dense audio embeddings. This module uses a convolutional front-end followed by a stack of transformer blocks.

Convolutional Front-End

The encoder begins with two 1-D convolutional layers that down-sample the temporal dimension while expanding the feature depth. According to the source code at lines 179-180, these are defined as:

  • self.conv1 – initial convolution with kernel_size=3 and stride=1
  • self.conv2 – strided convolution with kernel_size=3 and stride=2 that reduces the time steps by half

These layers extract local spectral patterns before the transformer layers model global temporal dependencies.

Positional Embeddings and Transformer Blocks

After the convolutional layers, the encoder adds learned sinusoidal positional embeddings (line 181) to inject temporal position information. The core computation occurs in a stack of ResidualAttentionBlock layers (lines 183-185), each containing:

  • Self-attention – allows the model to attend to any position in the audio sequence
  • MLP – a feed-forward network that processes each position independently

The number of blocks varies by model size (tiny, base, small, medium, large).

Forward Pass Implementation

The forward method (lines 188-204) implements the complete encoding pipeline:

def forward(self, x: Tensor):
    # x: (batch, n_mels, n_ctx)

    x = F.gelu(self.conv1(x))
    x = F.gelu(self.conv2(x))
    x = x.permute(0, 2, 1)                       # (batch, time, dim)

    x = (x + self.positional_embedding).to(x.dtype)
    for block in self.blocks:
        x = block(x)                              # self-attention + MLP

    return self.ln_post(x)

The final LayerNorm (ln_post, line 186) stabilizes the output, producing a tensor of shape (batch, n_audio_ctx, n_audio_state) that serves as the input to the TextDecoder.

TextDecoder: Generating Tokens from Audio Embeddings

The TextDecoder class (also in whisper/model.py) is an autoregressive transformer that predicts the next text token given previous tokens and the audio embeddings from the encoder. It employs cross-attention layers to query the audio representation while maintaining causal self-attention over text positions.

Token and Positional Embeddings

The decoder begins by converting token IDs into dense vectors using self.token_embedding (line 213), sharing the same dimensionality as the encoder output (n_state). It adds learned positional embeddings (line 214) to encode sequence position, sliced according to the current offset when using key-value caching.

Cross-Attention Mechanism

Unlike the encoder, each ResidualAttentionBlock in the decoder (lines 217-220) is initialized with cross_attention=True. This adds a multi-head attention layer where:

  • Queries come from the decoder's hidden states
  • Keys and Values come from the encoder's audio embeddings (xa parameter)

This mechanism allows the model to align text generation with specific acoustic features.

Causal Masking and Output Projection

The decoder enforces autoregressive generation through a triangular causal mask registered as self.mask (lines 224-226). This mask prevents the model from attending to future token positions during self-attention.

After processing through the transformer blocks and final layer normalization (self.ln, line 222), the model projects hidden states back to vocabulary logits using the transposed token embedding matrix:

logits = (x @ self.token_embedding.weight.t()).float()

This weight tying between input embeddings and output projection is a standard transformer optimization.

The Whisper Class: Wiring Encoder and Decoder Together

The top-level Whisper class (lines 256+ in whisper/model.py) orchestrates the two core components. It exposes several key methods:

  • embed_audio (line 87) – wraps AudioEncoder.forward to process mel-spectrograms
  • logits (line 90) – wraps TextDecoder.forward to compute next-token probabilities
  • forward (lines 94-96) – end-to-end inference used by high-level APIs

The class also imports utility functions from whisper/decoding.py, binding decode, transcribe, and detect_language as methods (lines 43-45).

Practical Code Examples

Loading the Model

from whisper import load_model

# Load the small English-only model (downloaded automatically)

model = load_model("small.en", device="cpu")   # or "cuda" if available

load_model resides in whisper/__init__.py and constructs a Whisper object with pretrained weights.

Encoding Audio Features

import numpy as np
from whisper.audio import log_mel_spectrogram

# Assume `wav` is a raw waveform (np.ndarray, shape (samples,))

mel = log_mel_spectrogram(wav, n_mels=model.dims.n_mels)  # → (n_mels, n_ctx)

audio_features = model.embed_audio(mel)          # shape (1, n_audio_ctx, n_audio_state)

print("Audio embedding shape:", audio_features.shape)

model.embed_audio forwards the mel-spectrogram through AudioEncoder (see lines 87-90 in whisper/model.py).

Generating Decoder Logits

from whisper.tokenizer import get_tokenizer

# Obtain a tokenizer for the model

tokenizer = get_tokenizer(model.is_multilingual,
                          num_languages=model.num_languages,
                          language="en",
                          task="transcribe")

# Start with the BOS token (<|start|> = tokenizer.sot)

tokens = torch.tensor([[tokenizer.sot]])          # shape (1, 1)

logits = model.logits(tokens, audio_features)   # shape (1, 1, vocab_size)

print("Logits shape:", logits.shape)

model.logits runs the TextDecoder (see lines 90-93 in whisper/model.py).

End-to-End Transcription

from whisper import transcribe

result = transcribe(model, "audio/example.wav")
print(result["text"])           # printed transcription

print(result["language"])       # detected language (if multilingual)

transcribe (implemented in whisper/transcribe.py) orchestrates chunked decoding, language detection, and optional word-level timestamps, all built on top of the encoder/decoder pair.

Key Source Files in openai/whisper

File Primary Role Link
whisper/model.py Core model definitions: AudioEncoder, TextDecoder, Whisper (assembly) view file
whisper/audio.py Audio preprocessing (log_mel_spectrogram, pad_or_trim, constants) view file
whisper/tokenizer.py Vocabulary, token-ID mapping, language handling view file
whisper/decoding.py Beam search, sampling, language detection, decode API view file
whisper/transcribe.py High-level transcription pipeline that uses the encoder/decoder view file
whisper/__init__.py Public entry points (load_model, convenience wrappers) view file

These files together constitute the full Whisper inference stack, with the Audio Encoder and Text Decoder forming the heart of the neural model.

Summary

  • AudioEncoder processes mel-spectrograms through convolutional layers and transformer blocks, outputting latent audio embeddings of shape (batch, n_audio_ctx, n_audio_state).
  • TextDecoder is an autoregressive transformer that uses cross-attention to query audio embeddings while applying causal masking to generate text tokens.
  • Both components are defined in whisper/model.py and orchestrated by the Whisper class, which exposes embed_audio and logits methods for direct access to encoder and decoder outputs.
  • High-level APIs like transcribe and decode build upon these core components to provide end-to-end speech recognition.

Frequently Asked Questions

What is the difference between the AudioEncoder and TextDecoder in Whisper?

The AudioEncoder is a feed-forward convolutional-transformer hybrid that processes entire audio segments in parallel, converting mel-spectrograms into dense vector representations. The TextDecoder is an autoregressive transformer that generates tokens one at a time, using cross-attention to pull information from the encoder's output while maintaining causal self-attention over previously generated tokens.

How does the TextDecoder attend to audio features during transcription?

The TextDecoder uses cross-attention layers (enabled by cross_attention=True in ResidualAttentionBlock at lines 217-220 of whisper/model.py) where queries originate from the decoder's hidden states and keys/values come from the encoder's audio embeddings (xa parameter). This mechanism allows each text token to attend to relevant acoustic patterns in the input audio.

What input format does the AudioEncoder expect?

The AudioEncoder expects a log-mel spectrogram tensor of shape (batch, n_mels, n_ctx), typically generated by whisper.audio.log_mel_spectrogram. The n_mels dimension is usually 80 or 128 depending on the model configuration, and n_ctx represents the time frames after padding or trimming to the model's context window (e.g., 1500 frames for 30 seconds of audio).

Can I use the AudioEncoder or TextDecoder independently?

Yes. The Whisper class exposes embed_audio() (line 87 in whisper/model.py) to run only the encoder and obtain audio embeddings, and logits() (line 90) to run the decoder on specific token sequences given encoder outputs. This allows advanced users to implement custom decoding strategies or use the encoder for audio representation learning separate from the standard transcription pipeline.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →