# How OpenAI Whisper Handles Audio Processing and Mel Spectrogram Generation

> Discover how OpenAI Whisper processes audio and generates mel spectrograms using its four-stage pipeline including ffmpeg loading STFT computation and Mel filterbank projection.

- Repository: [OpenAI/whisper](https://github.com/openai/whisper)
- Tags: internals
- Published: 2026-02-27

---

**OpenAI Whisper converts raw audio into normalized log-Mel spectrograms through a four-stage pipeline implemented in [`whisper/audio.py`](https://github.com/openai/whisper/blob/main/whisper/audio.py) that uses ffmpeg for loading, enforces 30-second chunks, computes STFT with 400-bin FFTs, and projects onto 80 or 128 Mel filterbanks.**

The `openai/whisper` repository processes arbitrary audio recordings by transforming them into fixed-size log-Mel representations that feed directly into the transformer encoder. Understanding how Whisper handles audio processing and mel spectrogram generation requires examining the core functions in [`whisper/audio.py`](https://github.com/openai/whisper/blob/main/whisper/audio.py), which orchestrate resampling, windowing, and filterbank projection before the neural network consumes the data.

## The Four-Stage Audio Processing Pipeline

Whisper’s audio front-end lives entirely in **[`whisper/audio.py`](https://github.com/openai/whisper/blob/main/whisper/audio.py)** and transforms raw waveforms into encoder-ready tensors through four discrete stages.

### Stage 1: Loading and Resampling with ffmpeg

The **`load_audio`** function spawns an **ffmpeg** subprocess via `subprocess.run` to handle virtually any audio format (MP3, WAV, FLAC, etc.). The pipeline down-mixes stereo inputs to mono and resamples to **16 kHz** (`SAMPLE_RATE = 16000`), returning the raw PCM as a `float32` NumPy array. This abstraction allows Whisper to ingest audio without format-specific Python dependencies beyond ffmpeg.

### Stage 2: Padding and Trimming to Fixed Length

Whisper’s encoder expects exactly **30 seconds** of audio per inference chunk, defined by the constant **`CHUNK_LENGTH = 30`**. The **`pad_or_trim`** function enforces this by either truncating longer clips or zero-padding shorter ones to exactly **480,000 samples** (`N_SAMPLES = CHUNK_LENGTH * SAMPLE_RATE`). This utility handles both NumPy arrays and PyTorch tensors, ensuring consistent input dimensions regardless of source audio duration.

### Stage 3: Short-Time Fourier Transform (STFT)

With the fixed-size waveform, **`log_mel_spectrogram`** constructs a Hann window using `torch.hann_window(N_FFT)` and executes the STFT via `torch.stft`. The transform uses **`N_FFT = 400`** (25 ms windows) and **`HOP_LENGTH = 160`** (10 ms stride), producing a complex spectrogram that captures time-frequency characteristics essential for speech recognition.

### Stage 4: Mel-Scale Projection and Log Compression

The magnitude spectrogram is projected onto the Mel scale using a **pre-computed filterbank** stored in `whisper/assets/mel_filters.npz`. The **`mel_filters`** function lazily loads these weights using `@lru_cache` to avoid redundant disk I/O. The projection supports **80 or 128 Mel bins** (`N_MELS`). After projection, values undergo clamping, logarithmic scaling, dynamic range compression (normalized to max-8 dB), and final normalization to the range **[0, 1]**, producing the tensor shape **`(n_mels, N_FRAMES)`**—typically **`(80, 3000)`** for default configurations.

## Key Hyperparameters and Tensor Dimensions

Whisper’s audio processing relies on precise constants defined at the top of [`whisper/audio.py`](https://github.com/openai/whisper/blob/main/whisper/audio.py):

- **`SAMPLE_RATE = 16000`** — Target sampling rate in Hz
- **`N_FFT = 400`** — FFT size covering 25 ms windows
- **`HOP_LENGTH = 160`** — Frame stride of 10 ms
- **`CHUNK_LENGTH = 30`** — Maximum audio duration in seconds
- **`N_SAMPLES = 480000`** — Total samples per chunk (`30 * 16000`)
- **`N_FRAMES = 3000`** — Time frames in the output spectrogram (`N_SAMPLES / HOP_LENGTH`)
- **`N_MELS = 80`** or **128** — Number of Mel filter banks (80 for base/small models, 128 for large)

The encoder expects input tensors of shape `(n_mels, N_FRAMES)`, calculated using helper utilities like `exact_div` from [`whisper/utils.py`](https://github.com/openai/whisper/blob/main/whisper/utils.py) to ensure integer divisibility.

## Caching and Device Placement Strategies

Whisper optimizes repeated inference through two mechanisms implemented in [`whisper/audio.py`](https://github.com/openai/whisper/blob/main/whisper/audio.py). The **`mel_filters`** function uses **`@lru_cache`** to load the filterbank from `assets/mel_filters.npz` only once per process, significantly reducing overhead during batch processing. For hardware acceleration, the **`log_mel_spectrogram`** function accepts a `device` argument that moves the waveform tensor to GPU **before** STFT computation, ensuring the heavy Fourier transforms execute on CUDA when available rather than CPU.

## Practical Implementation Examples

### Generate a Standard 80-Mel Spectrogram

```python
from whisper.audio import log_mel_spectrogram

audio_path = "examples/jfk.wav"

# Returns tensor of shape (80, 3000) normalized to [0, 1]

mel = log_mel_spectrogram(audio_path)
print(mel.shape)  # torch.Size([80, 3000])

```

### GPU Acceleration with 128 Mel Bins

```python
import torch
from whisper.audio import log_mel_spectrogram

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

mel_128 = log_mel_spectrogram(
    "examples/harvard.wav",
    n_mels=128,      # Required for Whisper large-v3

    device=device,
)
print(mel_128.shape)  # (128, 3000)

```

### Processing Raw NumPy Waveforms

```python
import numpy as np
from whisper.audio import log_mel_spectrogram

# Simulate 2 seconds of 16 kHz sine wave

sr = 16000
t = np.linspace(0, 2, 2 * sr, endpoint=False)
waveform = 0.5 * np.sin(2 * np.pi * 440 * t).astype(np.float32)

# Function handles padding/trimming internally

mel = log_mel_spectrogram(waveform)
print(f"Min: {mel.min():.3f}, Max: {mel.max():.3f}")  # Values in [0, 1]

```

## Summary

- **Four-stage pipeline**: Whisper uses ffmpeg loading (`load_audio`), fixed-length padding (`pad_or_trim`), STFT computation, and Mel-scale projection to transform audio.
- **Fixed dimensions**: All inputs become 30-second, 16 kHz mono waveforms yielding `(80, 3000)` or `(128, 3000)` log-Mel spectrograms.
- **Performance optimizations**: Filterbanks are LRU-cached from `assets/mel_filters.npz`, and GPU device placement occurs before Fourier transforms.
- **Core file**: All functionality resides in [`whisper/audio.py`](https://github.com/openai/whisper/blob/main/whisper/audio.py), with helper utilities in [`whisper/utils.py`](https://github.com/openai/whisper/blob/main/whisper/utils.py).

## Frequently Asked Questions

### What audio file formats does Whisper support for mel spectrogram generation?

Whisper supports virtually any audio format—including MP3, WAV, FLAC, and AAC—through its **`load_audio`** function in [`whisper/audio.py`](https://github.com/openai/whisper/blob/main/whisper/audio.py), which delegates decoding to ffmpeg via `subprocess.run`. As long as ffmpeg is installed on the system, Whisper can extract the raw PCM stream and resample it to the required 16 kHz mono format.

### Why does Whisper enforce exactly 30 seconds of audio per chunk?

The 30-second constraint (`CHUNK_LENGTH = 30`) ensures consistent tensor dimensions for the transformer encoder, which expects fixed-size inputs of shape `(n_mels, 3000)`. The **`pad_or_trim`** function zero-pads shorter audio or truncates longer clips to exactly 480,000 samples, enabling batch processing and positional encoding stability within the model architecture.

### What is the difference between 80-Mel and 128-Mel configurations in Whisper?

The **80-Mel** configuration (`N_MELS = 80`) is the default for base and small model variants, while the **128-Mel** configuration is required for the large-v3 model according to the source code. The Mel filterbank weights are pre-computed and stored in `whisper/assets/mel_filters.npz`, with the **`mel_filters`** function caching the appropriate matrix (80x201 or 128x201) based on the `n_mels` parameter passed to `log_mel_spectrogram`.

### How does Whisper optimize repeated spectrogram generation?

Whisper uses **`@lru_cache`** on the **`mel_filters`** function to load the Mel filterbank from `assets/mel_filters.npz` only once per Python process. Additionally, when a `device` argument is provided to `log_mel_spectrogram`, the waveform tensor is moved to that device (e.g., CUDA) before calling `torch.stft`, ensuring that expensive FFT computations execute on GPU hardware rather than CPU.