How OpenAI Whisper Handles Audio Processing and Mel Spectrogram Generation

OpenAI Whisper converts raw audio into normalized log-Mel spectrograms through a four-stage pipeline implemented in whisper/audio.py that uses ffmpeg for loading, enforces 30-second chunks, computes STFT with 400-bin FFTs, and projects onto 80 or 128 Mel filterbanks.

The openai/whisper repository processes arbitrary audio recordings by transforming them into fixed-size log-Mel representations that feed directly into the transformer encoder. Understanding how Whisper handles audio processing and mel spectrogram generation requires examining the core functions in whisper/audio.py, which orchestrate resampling, windowing, and filterbank projection before the neural network consumes the data.

The Four-Stage Audio Processing Pipeline

Whisper’s audio front-end lives entirely in whisper/audio.py and transforms raw waveforms into encoder-ready tensors through four discrete stages.

Stage 1: Loading and Resampling with ffmpeg

The load_audio function spawns an ffmpeg subprocess via subprocess.run to handle virtually any audio format (MP3, WAV, FLAC, etc.). The pipeline down-mixes stereo inputs to mono and resamples to 16 kHz (SAMPLE_RATE = 16000), returning the raw PCM as a float32 NumPy array. This abstraction allows Whisper to ingest audio without format-specific Python dependencies beyond ffmpeg.

Stage 2: Padding and Trimming to Fixed Length

Whisper’s encoder expects exactly 30 seconds of audio per inference chunk, defined by the constant CHUNK_LENGTH = 30. The pad_or_trim function enforces this by either truncating longer clips or zero-padding shorter ones to exactly 480,000 samples (N_SAMPLES = CHUNK_LENGTH * SAMPLE_RATE). This utility handles both NumPy arrays and PyTorch tensors, ensuring consistent input dimensions regardless of source audio duration.

Stage 3: Short-Time Fourier Transform (STFT)

With the fixed-size waveform, log_mel_spectrogram constructs a Hann window using torch.hann_window(N_FFT) and executes the STFT via torch.stft. The transform uses N_FFT = 400 (25 ms windows) and HOP_LENGTH = 160 (10 ms stride), producing a complex spectrogram that captures time-frequency characteristics essential for speech recognition.

Stage 4: Mel-Scale Projection and Log Compression

The magnitude spectrogram is projected onto the Mel scale using a pre-computed filterbank stored in whisper/assets/mel_filters.npz. The mel_filters function lazily loads these weights using @lru_cache to avoid redundant disk I/O. The projection supports 80 or 128 Mel bins (N_MELS). After projection, values undergo clamping, logarithmic scaling, dynamic range compression (normalized to max-8 dB), and final normalization to the range [0, 1], producing the tensor shape (n_mels, N_FRAMES)—typically (80, 3000) for default configurations.

Key Hyperparameters and Tensor Dimensions

Whisper’s audio processing relies on precise constants defined at the top of whisper/audio.py:

  • SAMPLE_RATE = 16000 — Target sampling rate in Hz
  • N_FFT = 400 — FFT size covering 25 ms windows
  • HOP_LENGTH = 160 — Frame stride of 10 ms
  • CHUNK_LENGTH = 30 — Maximum audio duration in seconds
  • N_SAMPLES = 480000 — Total samples per chunk (30 * 16000)
  • N_FRAMES = 3000 — Time frames in the output spectrogram (N_SAMPLES / HOP_LENGTH)
  • N_MELS = 80 or 128 — Number of Mel filter banks (80 for base/small models, 128 for large)

The encoder expects input tensors of shape (n_mels, N_FRAMES), calculated using helper utilities like exact_div from whisper/utils.py to ensure integer divisibility.

Caching and Device Placement Strategies

Whisper optimizes repeated inference through two mechanisms implemented in whisper/audio.py. The mel_filters function uses @lru_cache to load the filterbank from assets/mel_filters.npz only once per process, significantly reducing overhead during batch processing. For hardware acceleration, the log_mel_spectrogram function accepts a device argument that moves the waveform tensor to GPU before STFT computation, ensuring the heavy Fourier transforms execute on CUDA when available rather than CPU.

Practical Implementation Examples

Generate a Standard 80-Mel Spectrogram

from whisper.audio import log_mel_spectrogram

audio_path = "examples/jfk.wav"

# Returns tensor of shape (80, 3000) normalized to [0, 1]

mel = log_mel_spectrogram(audio_path)
print(mel.shape)  # torch.Size([80, 3000])

GPU Acceleration with 128 Mel Bins

import torch
from whisper.audio import log_mel_spectrogram

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

mel_128 = log_mel_spectrogram(
    "examples/harvard.wav",
    n_mels=128,      # Required for Whisper large-v3

    device=device,
)
print(mel_128.shape)  # (128, 3000)

Processing Raw NumPy Waveforms

import numpy as np
from whisper.audio import log_mel_spectrogram

# Simulate 2 seconds of 16 kHz sine wave

sr = 16000
t = np.linspace(0, 2, 2 * sr, endpoint=False)
waveform = 0.5 * np.sin(2 * np.pi * 440 * t).astype(np.float32)

# Function handles padding/trimming internally

mel = log_mel_spectrogram(waveform)
print(f"Min: {mel.min():.3f}, Max: {mel.max():.3f}")  # Values in [0, 1]

Summary

  • Four-stage pipeline: Whisper uses ffmpeg loading (load_audio), fixed-length padding (pad_or_trim), STFT computation, and Mel-scale projection to transform audio.
  • Fixed dimensions: All inputs become 30-second, 16 kHz mono waveforms yielding (80, 3000) or (128, 3000) log-Mel spectrograms.
  • Performance optimizations: Filterbanks are LRU-cached from assets/mel_filters.npz, and GPU device placement occurs before Fourier transforms.
  • Core file: All functionality resides in whisper/audio.py, with helper utilities in whisper/utils.py.

Frequently Asked Questions

What audio file formats does Whisper support for mel spectrogram generation?

Whisper supports virtually any audio format—including MP3, WAV, FLAC, and AAC—through its load_audio function in whisper/audio.py, which delegates decoding to ffmpeg via subprocess.run. As long as ffmpeg is installed on the system, Whisper can extract the raw PCM stream and resample it to the required 16 kHz mono format.

Why does Whisper enforce exactly 30 seconds of audio per chunk?

The 30-second constraint (CHUNK_LENGTH = 30) ensures consistent tensor dimensions for the transformer encoder, which expects fixed-size inputs of shape (n_mels, 3000). The pad_or_trim function zero-pads shorter audio or truncates longer clips to exactly 480,000 samples, enabling batch processing and positional encoding stability within the model architecture.

What is the difference between 80-Mel and 128-Mel configurations in Whisper?

The 80-Mel configuration (N_MELS = 80) is the default for base and small model variants, while the 128-Mel configuration is required for the large-v3 model according to the source code. The Mel filterbank weights are pre-computed and stored in whisper/assets/mel_filters.npz, with the mel_filters function caching the appropriate matrix (80x201 or 128x201) based on the n_mels parameter passed to log_mel_spectrogram.

How does Whisper optimize repeated spectrogram generation?

Whisper uses @lru_cache on the mel_filters function to load the Mel filterbank from assets/mel_filters.npz only once per Python process. Additionally, when a device argument is provided to log_mel_spectrogram, the waveform tensor is moved to that device (e.g., CUDA) before calling torch.stft, ensuring that expensive FFT computations execute on GPU hardware rather than CPU.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →