# Core Components of the Whisper Model: AudioEncoder and TextDecoder Explained

> **The Whisper model consists of two primary neural components: an AudioEncoder that converts mel-spectrograms into latent embeddings, and a TextDecoder that generates transcription tokens via cross-attention to those embeddings.**

- Repository: [OpenAI/whisper](https://github.com/openai/whisper)
- Tags: 
- Published: 2026-02-27

---

**The Whisper model consists of two primary neural components: an AudioEncoder that converts mel-spectrograms into latent embeddings, and a TextDecoder that generates transcription tokens via cross-attention to those embeddings.**

The `openai/whisper` repository implements a transformer-based automatic speech recognition (ASR) system. Understanding the core components of the Whisper model—specifically how the AudioEncoder and TextDecoder interact—is essential for customizing inference, debugging outputs, or fine-tuning the architecture.

## AudioEncoder: Converting Sound to Latent Representations

The `AudioEncoder` class, defined in [`whisper/model.py`](https://github.com/openai/whisper/blob/main/whisper/model.py), processes raw mel-spectrogram inputs and produces a sequence of dense audio embeddings. This module uses a convolutional front-end followed by a stack of transformer blocks.

### Convolutional Front-End

The encoder begins with two 1-D convolutional layers that down-sample the temporal dimension while expanding the feature depth. According to the source code at lines 179-180, these are defined as:

- `self.conv1` – initial convolution with `kernel_size=3` and `stride=1`
- `self.conv2` – strided convolution with `kernel_size=3` and `stride=2` that reduces the time steps by half

These layers extract local spectral patterns before the transformer layers model global temporal dependencies.

### Positional Embeddings and Transformer Blocks

After the convolutional layers, the encoder adds learned sinusoidal positional embeddings (line 181) to inject temporal position information. The core computation occurs in a stack of `ResidualAttentionBlock` layers (lines 183-185), each containing:

- **Self-attention** – allows the model to attend to any position in the audio sequence
- **MLP** – a feed-forward network that processes each position independently

The number of blocks varies by model size (tiny, base, small, medium, large).

### Forward Pass Implementation

The `forward` method (lines 188-204) implements the complete encoding pipeline:

```python
def forward(self, x: Tensor):
    # x: (batch, n_mels, n_ctx)

    x = F.gelu(self.conv1(x))
    x = F.gelu(self.conv2(x))
    x = x.permute(0, 2, 1)                       # (batch, time, dim)

    x = (x + self.positional_embedding).to(x.dtype)
    for block in self.blocks:
        x = block(x)                              # self-attention + MLP

    return self.ln_post(x)

```

The final `LayerNorm` (`ln_post`, line 186) stabilizes the output, producing a tensor of shape `(batch, n_audio_ctx, n_audio_state)` that serves as the input to the TextDecoder.

## TextDecoder: Generating Tokens from Audio Embeddings

The `TextDecoder` class (also in [`whisper/model.py`](https://github.com/openai/whisper/blob/main/whisper/model.py)) is an autoregressive transformer that predicts the next text token given previous tokens and the audio embeddings from the encoder. It employs cross-attention layers to query the audio representation while maintaining causal self-attention over text positions.

### Token and Positional Embeddings

The decoder begins by converting token IDs into dense vectors using `self.token_embedding` (line 213), sharing the same dimensionality as the encoder output (`n_state`). It adds learned positional embeddings (line 214) to encode sequence position, sliced according to the current offset when using key-value caching.

### Cross-Attention Mechanism

Unlike the encoder, each `ResidualAttentionBlock` in the decoder (lines 217-220) is initialized with `cross_attention=True`. This adds a multi-head attention layer where:

- **Queries** come from the decoder's hidden states
- **Keys and Values** come from the encoder's audio embeddings (`xa` parameter)

This mechanism allows the model to align text generation with specific acoustic features.

### Causal Masking and Output Projection

The decoder enforces autoregressive generation through a triangular causal mask registered as `self.mask` (lines 224-226). This mask prevents the model from attending to future token positions during self-attention.

After processing through the transformer blocks and final layer normalization (`self.ln`, line 222), the model projects hidden states back to vocabulary logits using the transposed token embedding matrix:

```python
logits = (x @ self.token_embedding.weight.t()).float()

```

This weight tying between input embeddings and output projection is a standard transformer optimization.

## The Whisper Class: Wiring Encoder and Decoder Together

The top-level `Whisper` class (lines 256+ in [`whisper/model.py`](https://github.com/openai/whisper/blob/main/whisper/model.py)) orchestrates the two core components. It exposes several key methods:

- **`embed_audio`** (line 87) – wraps `AudioEncoder.forward` to process mel-spectrograms
- **`logits`** (line 90) – wraps `TextDecoder.forward` to compute next-token probabilities
- **`forward`** (lines 94-96) – end-to-end inference used by high-level APIs

The class also imports utility functions from [`whisper/decoding.py`](https://github.com/openai/whisper/blob/main/whisper/decoding.py), binding `decode`, `transcribe`, and `detect_language` as methods (lines 43-45).

## Practical Code Examples

### Loading the Model

```python
from whisper import load_model

# Load the small English-only model (downloaded automatically)

model = load_model("small.en", device="cpu")   # or "cuda" if available

```

`load_model` resides in [`whisper/__init__.py`](https://github.com/openai/whisper/blob/main/whisper/__init__.py) and constructs a `Whisper` object with pretrained weights.

### Encoding Audio Features

```python
import numpy as np
from whisper.audio import log_mel_spectrogram

# Assume `wav` is a raw waveform (np.ndarray, shape (samples,))

mel = log_mel_spectrogram(wav, n_mels=model.dims.n_mels)  # → (n_mels, n_ctx)

audio_features = model.embed_audio(mel)          # shape (1, n_audio_ctx, n_audio_state)

print("Audio embedding shape:", audio_features.shape)

```

`model.embed_audio` forwards the mel-spectrogram through `AudioEncoder` (see lines 87-90 in [`whisper/model.py`](https://github.com/openai/whisper/blob/main/whisper/model.py)).

### Generating Decoder Logits

```python
from whisper.tokenizer import get_tokenizer

# Obtain a tokenizer for the model

tokenizer = get_tokenizer(model.is_multilingual,
                          num_languages=model.num_languages,
                          language="en",
                          task="transcribe")

# Start with the BOS token (<|start|> = tokenizer.sot)

tokens = torch.tensor([[tokenizer.sot]])          # shape (1, 1)

logits = model.logits(tokens, audio_features)   # shape (1, 1, vocab_size)

print("Logits shape:", logits.shape)

```

`model.logits` runs the `TextDecoder` (see lines 90-93 in [`whisper/model.py`](https://github.com/openai/whisper/blob/main/whisper/model.py)).

### End-to-End Transcription

```python
from whisper import transcribe

result = transcribe(model, "audio/example.wav")
print(result["text"])           # printed transcription

print(result["language"])       # detected language (if multilingual)

```

`transcribe` (implemented in [`whisper/transcribe.py`](https://github.com/openai/whisper/blob/main/whisper/transcribe.py)) orchestrates chunked decoding, language detection, and optional word-level timestamps, all built on top of the encoder/decoder pair.

## Key Source Files in openai/whisper

| File | Primary Role | Link |
|------|--------------|------|
| [`whisper/model.py`](https://github.com/openai/whisper/blob/main/whisper/model.py) | Core model definitions: `AudioEncoder`, `TextDecoder`, `Whisper` (assembly) | [view file](https://github.com/openai/whisper/blob/main/whisper/model.py) |
| [`whisper/audio.py`](https://github.com/openai/whisper/blob/main/whisper/audio.py) | Audio preprocessing (`log_mel_spectrogram`, `pad_or_trim`, constants) | [view file](https://github.com/openai/whisper/blob/main/whisper/audio.py) |
| [`whisper/tokenizer.py`](https://github.com/openai/whisper/blob/main/whisper/tokenizer.py) | Vocabulary, token-ID mapping, language handling | [view file](https://github.com/openai/whisper/blob/main/whisper/tokenizer.py) |
| [`whisper/decoding.py`](https://github.com/openai/whisper/blob/main/whisper/decoding.py) | Beam search, sampling, language detection, `decode` API | [view file](https://github.com/openai/whisper/blob/main/whisper/decoding.py) |
| [`whisper/transcribe.py`](https://github.com/openai/whisper/blob/main/whisper/transcribe.py) | High-level transcription pipeline that uses the encoder/decoder | [view file](https://github.com/openai/whisper/blob/main/whisper/transcribe.py) |
| [`whisper/__init__.py`](https://github.com/openai/whisper/blob/main/whisper/__init__.py) | Public entry points (`load_model`, convenience wrappers) | [view file](https://github.com/openai/whisper/blob/main/whisper/__init__.py) |

These files together constitute the full Whisper inference stack, with the **Audio Encoder** and **Text Decoder** forming the heart of the neural model.

## Summary

- **AudioEncoder** processes mel-spectrograms through convolutional layers and transformer blocks, outputting latent audio embeddings of shape `(batch, n_audio_ctx, n_audio_state)`.
- **TextDecoder** is an autoregressive transformer that uses cross-attention to query audio embeddings while applying causal masking to generate text tokens.
- Both components are defined in [`whisper/model.py`](https://github.com/openai/whisper/blob/main/whisper/model.py) and orchestrated by the `Whisper` class, which exposes `embed_audio` and `logits` methods for direct access to encoder and decoder outputs.
- High-level APIs like `transcribe` and `decode` build upon these core components to provide end-to-end speech recognition.

## Frequently Asked Questions

### What is the difference between the AudioEncoder and TextDecoder in Whisper?

The **AudioEncoder** is a feed-forward convolutional-transformer hybrid that processes entire audio segments in parallel, converting mel-spectrograms into dense vector representations. The **TextDecoder** is an autoregressive transformer that generates tokens one at a time, using cross-attention to pull information from the encoder's output while maintaining causal self-attention over previously generated tokens.

### How does the TextDecoder attend to audio features during transcription?

The `TextDecoder` uses **cross-attention layers** (enabled by `cross_attention=True` in `ResidualAttentionBlock` at lines 217-220 of [`whisper/model.py`](https://github.com/openai/whisper/blob/main/whisper/model.py)) where queries originate from the decoder's hidden states and keys/values come from the encoder's audio embeddings (`xa` parameter). This mechanism allows each text token to attend to relevant acoustic patterns in the input audio.

### What input format does the AudioEncoder expect?

The `AudioEncoder` expects a **log-mel spectrogram** tensor of shape `(batch, n_mels, n_ctx)`, typically generated by `whisper.audio.log_mel_spectrogram`. The `n_mels` dimension is usually 80 or 128 depending on the model configuration, and `n_ctx` represents the time frames after padding or trimming to the model's context window (e.g., 1500 frames for 30 seconds of audio).

### Can I use the AudioEncoder or TextDecoder independently?

Yes. The `Whisper` class exposes `embed_audio()` (line 87 in [`whisper/model.py`](https://github.com/openai/whisper/blob/main/whisper/model.py)) to run only the encoder and obtain audio embeddings, and `logits()` (line 90) to run the decoder on specific token sequences given encoder outputs. This allows advanced users to implement custom decoding strategies or use the encoder for audio representation learning separate from the standard transcription pipeline.