How to Use Whisper for Language Detection Without Transcription: 2 Methods Explained
You can detect the spoken language in an audio file using Whisper by calling model.detect_language() for a lightweight encoder-only check, or by setting task="lang_id" in DecodingOptions to use the high-level decoding API without generating text tokens.
The OpenAI Whisper library provides dedicated pathways to identify the language of an audio clip without incurring the computational cost of full transcription. Whether you need to route audio to language-specific pipelines or simply log metadata, these methods allow you to use Whisper for language detection without transcription overhead.
Understanding Whisper's Language Detection Architecture
Whisper identifies languages through a specialized decoder step that predicts the language token immediately following the start-of-transcript (<|sot|>) token. This requires only a single forward pass through the encoder and one decoder step, bypassing the autoregressive text generation loop.
The Encoder-Only Approach (model.detect_language)
The Whisper.detect_language method in whisper/model.py handles language detection at the model level. It accepts a mel-spectrogram and tokenizer, runs the encoder if needed, and performs a single decoder forward pass masked to only consider language tokens.
The Decoding Task Approach (task="lang_id")
The DecodingTask class in whisper/decoding.py supports a specialized task mode. When DecodingOptions specifies task="lang_id", the decoding pipeline invokes model.detect_language and immediately returns a DecodingResult containing only the language probabilities, skipping all text token generation.
Method 1: Direct Language Detection with model.detect_language
This approach provides the most lightweight implementation, requiring only the model instance and audio preprocessing.
import whisper
from whisper.audio import log_mel_spectrogram
from whisper.tokenizer import get_tokenizer
# Load a multilingual model (e.g., "base")
model = whisper.load_model("base")
# Load and preprocess audio (30 s or less)
audio = whisper.load_audio("speech.wav") # waveform, shape (samples,)
mel = log_mel_spectrogram(audio).unsqueeze(0) # (1, 80, 3000)
# Get the tokenizer for the model (multilingual)
tokenizer = get_tokenizer(model.is_multilingual,
num_languages=model.num_languages)
# Detect language
language_token, language_probs = model.detect_language(mel, tokenizer)
detected_lang = whisper.tokenizer.LANGUAGES[language_token]
print(f"Detected language: {detected_lang.title()}")
# Optional: print the full probability distribution
# print(language_probs)
The detect_language method in whisper/model.py (lines 43-54) handles the core logic: it encodes the mel-spectrogram, passes the <|sot|> token through the decoder, and masks out all non-language tokens to isolate the language prediction.
Method 2: Using the High-Level decode API with task="lang_id"
For applications already using the decoding pipeline, this method integrates seamlessly with existing Whisper workflows.
import whisper
from whisper.decoding import DecodingOptions
model = whisper.load_model("small")
# Load audio and compute mel (same as before)
audio = whisper.load_audio("speech.wav")
mel = whisper.log_mel_spectrogram(audio)
# Ask Whisper to run a language-identification task only
options = DecodingOptions(task="lang_id") # language=None by default
result = whisper.decode(model, mel, options)
print(f"Detected language: {result.language.title()}")
# result.language_probs holds the full distribution (if needed)
The DecodingTask._detect_language method in whisper/decoding.py (lines 66-78) checks for the lang_id task flag. When detected, it invokes model.detect_language and short-circuits the remaining decoding pipeline, returning a DecodingResult containing only the language information.
Batch Processing Multiple Audio Files
The detect_language method accepts batched inputs, allowing efficient processing of multiple audio clips with a single encoder pass per clip.
import whisper
import torch
from whisper.audio import log_mel_spectrogram
from whisper.tokenizer import get_tokenizer
model = whisper.load_model("medium")
tokenizer = get_tokenizer(model.is_multilingual,
num_languages=model.num_languages)
files = ["a.wav", "b.wav", "c.wav"]
mels = [log_mel_spectrogram(whisper.load_audio(f)).unsqueeze(0) for f in files]
batch = torch.cat(mels, dim=0) # shape (batch, 80, 3000)
lang_tokens, lang_probs = model.detect_language(batch, tokenizer)
for f, token in zip(files, lang_tokens):
print(f"{f}: {whisper.tokenizer.LANGUAGES[token].title()}")
This works because model.detect_language processes the batch dimension element-wise, running the encoder once per spectrogram and returning a language token for each item in the batch.
Key Source Files and Implementation Details
| File | Purpose | Key Components |
|---|---|---|
whisper/model.py |
Core model architecture and language detection logic | Whisper.detect_language (lines 43-54), encoder/decoder wrappers |
whisper/decoding.py |
Decoding pipeline and task handling | DecodingOptions, DecodingTask._detect_language (lines 66-78) |
whisper/transcribe.py |
High-level transcription and decoding entry points | decode function, default task handling |
whisper/audio.py |
Audio preprocessing utilities | load_audio, log_mel_spectrogram |
whisper/tokenizer.py |
Tokenizer creation and language tables | get_tokenizer, LANGUAGES, TO_LANGUAGE_CODE |
Both detection methods rely on the same efficient architecture: a single encoder pass followed by a single decoder step that predicts the language token immediately following the <|sot|> token. This bypasses the autoregressive text generation loop entirely, making language detection significantly faster than full transcription.
Summary
- Direct API: Call
model.detect_language(mel, tokenizer)inwhisper/model.pyfor the most lightweight, encoder-only language detection. - High-level API: Set
task="lang_id"inDecodingOptionswhen callingwhisper.decode()to use the standard decoding pipeline while skipping text generation. - Performance: Both methods require only one encoder pass and a single decoder step, avoiding the costly autoregressive loop used for transcription.
- Batch support: Pass multiple mel-spectrograms to
detect_languageto process multiple audio files efficiently.
Frequently Asked Questions
Can Whisper detect language without transcribing the audio?
Yes. Whisper provides dedicated pathways to identify the spoken language without generating text tokens. You can use model.detect_language() for a direct model call, or set task="lang_id" in DecodingOptions when using the high-level decode function. Both methods stop after predicting the language token that follows the start-of-transcript token.
Which Whisper method is faster for language detection?
Both the direct model.detect_language approach and the task="lang_id" decoding approach offer identical performance characteristics. Each requires only a single forward pass through the encoder and one decoder step to predict the language token. This is significantly faster than full transcription, which requires an autoregressive loop generating hundreds of tokens.
Does language detection work with all Whisper model sizes?
Yes. Language detection is available across all Whisper model sizes (tiny, base, small, medium, large, large-v1, large-v2, large-v3). However, larger models generally provide more accurate language identification, particularly for similar languages or noisy audio, because they have more capacity to distinguish subtle acoustic and linguistic patterns.
Can I detect languages in multiple audio files simultaneously?
Yes. The model.detect_language method accepts batched mel-spectrograms. You can preprocess multiple audio files into individual mel-spectrograms, concatenate them along the batch dimension (shape (batch, 80, 3000)), and pass the batch to detect_language. The method returns a language token for each item in the batch, enabling efficient bulk processing.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →