Whisper Model Sizes: VRAM Requirements and Speed Trade-offs Explained

OpenAI Whisper provides six model variants—tiny, base, small, medium, large, and turbo—that range from 39 million to 1.55 billion parameters, requiring between ~1 GB and ~10 GB of GPU VRAM and offering relative inference speeds varying by up to 10× on an NVIDIA A100 GPU.

OpenAI's Whisper is an open-source automatic speech recognition (ASR) system that ships with multiple pre-trained checkpoints to accommodate different hardware constraints and latency requirements. Understanding the available Whisper model sizes and their associated VRAM requirements and speed trade-offs enables developers to select the optimal variant for edge devices, consumer GPUs, or high-performance data center hardware. The openai/whisper repository implements these variants using a unified Transformer-based encoder-decoder architecture defined in whisper/model.py, where each size differs only in layer depth and hidden dimension configurations.

Available Whisper Model Sizes and Specifications

The Whisper ecosystem provides six official model sizes, four of which offer English-only variants (denoted by the .en suffix) optimized for monolingual speech recognition. According to the official documentation in README.md (lines 64-71), the specifications break down as follows:

  • tiny: 39 M parameters, ~1 GB VRAM, ~10× relative speed, available as tiny.en (English-only) and tiny (multilingual)
  • base: 74 M parameters, ~1 GB VRAM, ~7× relative speed, available as base.en and base
  • small: 244 M parameters, ~2 GB VRAM, ~4× relative speed, available as small.en and small
  • medium: 769 M parameters, ~5 GB VRAM, ~2× relative speed, available as medium.en and medium
  • large: 1,550 M parameters, ~10 GB VRAM, 1× relative speed (baseline), multilingual only (large)
  • turbo: 809 M parameters, ~6 GB VRAM, ~8× relative speed, multilingual only (turbo)

The large model serves as the accuracy baseline but requires the most memory, while turbo offers a compelling middle ground with near-large accuracy at significantly reduced latency.

Architecture and Memory Consumption

All Whisper variants share the identical Transformer encoder-decoder architecture implemented in the Whisper class within whisper/model.py. The model loads checkpoint metadata to construct a dimension specification object (model.dims) that defines the width (hidden units) and depth (number of layers) for that specific size. Larger models allocate proportionally more tensors for self-attention weights and feed-forward layers, directly increasing GPU memory consumption during the forward pass.

The turbo variant represents a specialized optimization of the large-v3 architecture. While maintaining approximately 809 M parameters, it applies architectural tweaks including fused linear layers and reduced precision computations to achieve roughly 8× the speed of the standard large model with only ~6 GB VRAM usage—compared to the large model's ~10 GB requirement.

Loading Models and Inspecting Hardware Requirements

The load_model function in whisper/__init__.py (lines 17-23) resolves model names to download URLs via the internal _MODELS dictionary and handles checkpoint caching. You can programmatically inspect a model's architectural dimensions to verify resource requirements before transcription.

import whisper

# Load a lightweight model for low-VRAM environments (~1 GB)

model = whisper.load_model("base")  # 74 M parameters

# Load the fastest multilingual option (~6 GB VRAM)

turbo = whisper.load_model("turbo")  # 809 M parameters, ~8× speed

# Inspect architectural dimensions that determine memory usage

print(f"Encoder layers: {model.dims.n_encoder_layers}")
print(f"Decoder layers: {model.dims.n_decoder_layers}")
print(f"Hidden dimension: {model.dims.n_audio_state}")

# Transcribe audio

result = model.transcribe("audio.wav")
print(result["text"])

Performance Benchmarks and Trade-offs

When benchmarked on an NVIDIA A100 GPU using English speech transcription, the relative speed differences between Whisper model sizes reveal significant trade-offs between accuracy and latency:

  • tiny: ~10× faster than large, suitable for real-time prototyping on resource-constrained devices
  • base: ~7× faster than large, ideal for batch processing on consumer GPUs
  • small: ~4× faster than large, balancing accuracy and speed for production deployments
  • medium: ~2× faster than large, offering near-large accuracy with half the latency
  • large: Baseline 1× speed, maximum accuracy, requires ~10 GB VRAM
  • turbo: ~8× faster than large, optimized for low-latency applications requiring multilingual support

Summary

  • Whisper provides six model sizes ranging from 39 M to 1.55 B parameters, with VRAM requirements scaling from ~1 GB to ~10 GB.
  • English-only variants (.en) exist for tiny, base, small, and medium sizes, offering slightly better accuracy on monolingual English audio.
  • The turbo model delivers ~8× the speed of large with only ~6 GB VRAM by applying architectural optimizations to the large-v3 checkpoint.
  • Model dimensions are accessible via model.dims after loading, exposing layer counts and hidden states that correlate directly with memory consumption.
  • The load_model function in whisper/__init__.py automatically downloads and caches checkpoints based on the _MODELS registry.

Frequently Asked Questions

Which Whisper model size should I use for real-time transcription on limited hardware?

For devices with only 2 GB of VRAM, use the small model (244 M parameters), which runs at approximately 4× the speed of the large baseline while maintaining acceptable word-error rates. If VRAM is constrained to ~1 GB, the base or tiny models provide the fastest inference (~7× to ~10× speed) at the cost of reduced accuracy on noisy audio or accented speech.

What is the difference between .en models and multilingual models?

The .en variants (tiny.en, base.en, small.en, medium.en) are trained exclusively on English audio and typically achieve lower word-error rates on monolingual English speech compared to their multilingual counterparts. The large and turbo models are multilingual only and support 99 languages, making them necessary for non-English transcription or automatic language detection workflows.

How does the turbo model achieve faster speeds than large with similar parameter counts?

The turbo model applies architectural optimizations to the large-v3 checkpoint, including fused linear layer operations and reduced precision computations, allowing it to process audio at ~8× the speed of the standard large model while using only ~6 GB of VRAM versus the large model's ~10 GB requirement. These optimizations maintain most of the large model's accuracy while significantly reducing computational overhead.

How can I verify VRAM requirements before running transcription?

After loading a model with whisper.load_model(), inspect the model.dims object to view the encoder and decoder layer counts and hidden dimensions. While PyTorch does not provide a native VRAM calculator, you can estimate memory usage by monitoring GPU allocation with torch.cuda.memory_allocated() immediately after model instantiation, or reference the documented requirements (~1 GB for tiny/base, ~2 GB for small, ~5 GB for medium, ~6 GB for turbo, and ~10 GB for large).

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →