Whisper DecodingOptions Sampling Parameters: A Complete Guide to Stochastic Generation
The key sampling parameters in whisper.DecodingOptions include temperature for controlling randomness, best_of for candidate selection, patience and length_penalty for beam management, and fallback thresholds like compression_ratio_threshold and logprob_threshold for quality control.
OpenAI Whisper uses the DecodingOptions dataclass in whisper/decoding.py to centralize every knob that controls how the model generates text from encoded audio. When operating in sampling mode—stochastic generation rather than deterministic beam search—these parameters determine the trade-off between transcription reliability and creative diversity.
Core Sampling Parameters in DecodingOptions
Temperature and Stochastic Control
The temperature parameter scales logits before the softmax operation. Lower values sharpen the probability distribution, making the model more deterministic, while higher values inject randomness.
- 0.0 → Greedy decoding (most deterministic)
- 0.7-1.0 → Moderate diversity
- >1.0 → High randomness, more varied outputs
In whisper/decoding.py, this directly affects the token sampling probability distribution during each generation step.
Best-of Sampling and Candidate Selection
The best_of parameter works in conjunction with temperature to draw multiple independent samples. The decoder generates best_of candidate transcriptions and returns the one with the highest average log-probability.
This trades compute for quality:
import whisper
model = whisper.load_model("base")
audio = whisper.load_audio("audio.wav")
mel = whisper.log_mel_spectrogram(audio).to(model.device)
# Draw 5 samples and pick the best
options = whisper.DecodingOptions(
temperature=0.8,
best_of=5
)
result = model.decode(mel, options)
print(result.text)
Patience and Length Penalty
The patience parameter controls early-stopping behavior during beam-search-style sampling. Values greater than 1.0 relax beam pruning, encouraging exploration by keeping more candidates alive longer.
The length_penalty applies a multiplicative adjustment (sequence_length) ** (-length_penalty) to longer hypotheses:
- Positive values discourage overly long transcriptions
- Negative values encourage longer, more detailed outputs
- 0.0 disables length normalization
Fallback and Quality Control Parameters
Compression Ratio and Log Probability Thresholds
Whisper implements automatic fallback mechanisms when initial sampling produces low-quality candidates. The temperature_increment_on_fallback triggers when the decoder needs to retry, automatically increasing temperature to introduce randomness and escape local optima.
The compression_ratio_threshold compares the gzip compression ratio of generated text against this limit. Exceeding the threshold indicates repetitive output, triggering a retry with higher temperature.
The logprob_threshold sets the minimum average log-probability per token. Hypotheses scoring below this value are rejected, prompting the decoder to sample again with increased temperature.
# Configure fallback behavior
options = whisper.DecodingOptions(
temperature=0.0, # Start greedy
temperature_increment_on_fallback=0.2,
compression_ratio_threshold=2.4,
logprob_threshold=-1.0,
patience=1.5
)
No-Speech Detection
The no_speech_threshold enables silence detection during sampling. When the average log-probability of the <|nospeech|> token exceeds this threshold, the decoder returns an empty transcription. This prevents hallucinated text during silent audio segments.
Token Suppression
The suppress_tokens parameter accepts a list of token IDs that are forcibly set to negative infinity probability during sampling. This guarantees specific tokens never appear in the output. Common use cases include suppressing specific punctuation or formatting tokens.
According to the Whisper source code in whisper/tokenizer.py, you can use predefined constants like suppress_token_id or custom token IDs:
# Suppress specific tokens
tokenizer = whisper.tokenizer.get_tokenizer(multilingual=True)
options = whisper.DecodingOptions(
temperature=0.7,
suppress_tokens=[tokenizer.suppress_token_id, 50363] # Block specific tokens
)
Practical Implementation Examples
When working with the high-level API in whisper/transcribe.py, you can pass DecodingOptions parameters directly to model.transcribe():
import whisper
model = whisper.load_model("base")
# High-quality sampling configuration
result = model.transcribe(
"audio.wav",
temperature=0.8,
best_of=5,
patience=1.2,
length_penalty=0.2,
compression_ratio_threshold=2.4,
logprob_threshold=-1.0,
no_speech_threshold=0.6
)
print(result["text"])
For direct decoder access as implemented in whisper/model.py, instantiate DecodingOptions explicitly:
audio = whisper.load_audio("audio.wav")
mel = whisper.log_mel_spectrogram(audio).to(model.device)
# Conservative sampling for accurate transcription
options = whisper.DecodingOptions(
temperature=0.0, # Greedy
patience=1.0,
suppress_tokens=[]
)
result = model.decode(mel, options)
print(result.text)
Summary
temperaturecontrols randomness: lower values produce deterministic output, higher values increase diversity.best_ofenables multiple sampling runs, returning the highest probability candidate.patienceandlength_penaltymanage beam search behavior and sequence length preferences.- Fallback parameters (
temperature_increment_on_fallback,compression_ratio_threshold,logprob_threshold) automatically retry low-quality generations with increased randomness. no_speech_thresholddetects silent audio segments to prevent hallucinations.suppress_tokensblocks specific token IDs from appearing in the output.
Frequently Asked Questions
What is the difference between temperature and best_of in Whisper sampling?
Temperature scales the logits before softmax to control randomness within a single generation pass, while best_of runs multiple independent sampling passes and selects the candidate with the highest average log-probability. You can combine them by setting temperature=0.8 and best_of=5 to generate diverse candidates and keep the best one.
How does the compression_ratio_threshold prevent repetitive output?
The compression_ratio_threshold measures the gzip compression ratio of generated text. Repetitive sequences compress extremely well, so if the ratio exceeds the threshold (default 2.4), Whisper treats the hypothesis as "stuck" and retries with a higher temperature. This mechanism, defined in whisper/decoding.py, automatically escapes repetitive loops.
When should I use patience versus temperature for controlling generation?
Use temperature when you want to adjust the fundamental randomness of token selection—lower for accurate transcription, higher for creative or exploratory tasks. Use patience (values >1.0) when running beam-search-style sampling to keep more candidate beams alive longer, which improves quality at the cost of speed without increasing token-level randomness.
What happens when no_speech_threshold is exceeded during decoding?
When the average log-probability of the <|nospeech|> token exceeds the no_speech_threshold, the decoder immediately returns an empty transcription. As implemented in whisper/decoding.py, this prevents the model from hallucinating text during silent audio segments, making it useful for voice activity detection in streaming applications.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →