Whisper model.transcribe() Advanced Parameters: Temperature, Thresholds, and Decoding Options

The model.transcribe() function in OpenAI Whisper exposes advanced parameters—including temperature scheduling, compression-ratio thresholds, and DecodingOptions—that control sampling strategies, quality validation, and fallback loops to optimize transcription accuracy.

The model.transcribe() method in the openai/whisper repository serves as the primary Python interface for speech-to-text inference. While basic usage requires only an audio path or tensor, the function signature in whisper/transcribe.py accepts over a dozen advanced parameters that govern decoding behavior, prompt conditioning, and timestamp granularity. Mastering these parameters allows developers to suppress hallucinations, handle noisy audio, and extract word-level alignments.

Temperature Schedules and Quality Thresholds

The transcription pipeline implements an automatic fallback mechanism that sequentially retries decoding with different temperatures until quality checks pass.

Configuring Temperature Sampling

The temperature parameter accepts either a single float or a tuple of floats. When a tuple is provided—commonly (0.0, 0.2, 0.4, 0.6, 0.8, 1.0)—Whisper attempts each value in order until the output passes validation thresholds. According to the implementation in whisper/transcribe.py (lines 84-87), the decoder disables beam search when temperature > 0 (sampling mode) and disables best-of sampling when temperature == 0 (greedy mode).

result = model.transcribe(
    audio_path,
    temperature=(0.0, 0.2, 0.4, 0.6, 0.8, 1.0),
    compression_ratio_threshold=2.5,
    logprob_threshold=-1.2,
)

Compression Ratio and Log-Probability Thresholds

Two primary thresholds filter low-quality generations:

  • compression_ratio_threshold (default 2.4): Maximum allowed gzip-compression ratio of the decoded text. Higher ratios indicate repetitive or "stuck" output, triggering a fallback to the next temperature (lines 76-82).
  • logprob_threshold (default -1.0): Minimum average log-probability per token. If the model's confidence falls below this value, the decode is rejected (lines 84-90).

No-Speech Detection

The no_speech_threshold (default 0.6) defines the probability of the special <|nospeech|> token above which a segment is treated as silence. As implemented in lines 92-102, this check only triggers a fallback bypass when combined with a failing logprob_threshold, preventing the loop from retrying actual silence.

Prompt Conditioning and Context Management

Whisper maintains context across audio windows through prompt conditioning, which can be tuned or disabled depending on the use case.

Initial Prompts and Carry Behavior

The initial_prompt parameter injects domain-specific text at the start of the prompt, biasing the model toward specialized vocabulary (e.g., medical or legal terminology). When carry_initial_prompt=True, this text is prepended to every internal decode() call rather than only the first window. This logic resides in whisper/transcribe.py (lines 98-110).

result = model.transcribe(
    audio_path,
    initial_prompt="Medical terminology: ECG, arrhythmia, cardiology.",
    carry_initial_prompt=True,
    temperature=0.3,
    beam_size=5,
)

Conditioning on Previous Text

By default, condition_on_previous_text=True feeds the transcription of the previous window back as a prompt for the next window. Disabling this parameter can prevent the model from getting "stuck" in repetitive loops at the cost of occasional incoherence between segments (lines 86-89).

Timestamp and Segmentation Controls

For applications requiring precise alignment or processing of specific audio clips, several parameters control segmentation boundaries.

Word-Level Timestamps

Setting word_timestamps=True enables extraction of word-level timestamps using cross-attention weights and dynamic time-warping, with boundaries refined in whisper/timing.py. When enabled, the prepend_punctuations and append_punctuations parameters (defaulting to "\"'“¿([{-" and "\"'.。,,!!??::”)]}、" respectively) determine which punctuation characters merge with adjacent words.

result = model.transcribe(
    audio_path,
    word_timestamps=True,
    prepend_punctuations="\"'“([{-",
    append_punctuations="\"'.!?,;:)】}",
    hallucination_silence_threshold=0.5,
)

The hallucination_silence_threshold (default None) activates a filter that skips silent periods longer than the specified value when a potential hallucination is detected during word-level processing.

Clip Timestamps

The clip_timestamps parameter accepts a comma-separated string of start/end times in seconds (e.g., "30,45") to restrict transcription to specific audio segments. The parsing logic in lines 66-73 of whisper/transcribe.py converts these into frame indices before inference.

All additional keyword arguments passed to model.transcribe() are forwarded as **decode_options to the DecodingOptions dataclass defined in whisper/decoding.py (lines 80-112). This provides low-level control over the inference strategy:

  • beam_size: Number of beams for beam search (active when temperature=0).
  • best_of: Number of candidates to sample when using non-zero temperature.
  • patience: Factor that encourages longer hypotheses during beam search.
  • length_penalty: Exponential penalty applied to sequence length.
  • suppress_tokens: Comma-separated list of token IDs to suppress (e.g., "50259,50260").
result = model.transcribe(
    audio_path,
    temperature=0.0,
    beam_size=8,
    patience=1.5,
    length_penalty=0.6,
    suppress_tokens="-1,50258",
)

Summary

  • Temperature scheduling: Pass a tuple like (0.0, 0.2, 0.4) to automatically retry with higher sampling temperatures if quality checks fail.
  • Quality thresholds: Adjust compression_ratio_threshold, logprob_threshold, and no_speech_threshold in whisper/transcribe.py to filter repetitive or low-confidence outputs.
  • Prompt control: Use initial_prompt with carry_initial_prompt=True to bias every window toward domain-specific vocabulary.
  • Temporal precision: Enable word_timestamps=True for sub-segment alignment and use clip_timestamps to process specific audio intervals.
  • Decoding strategies: Pass beam search parameters (beam_size, patience) or sampling parameters (best_of) via **decode_options to the underlying DecodingOptions class.

Frequently Asked Questions

What is the difference between temperature and beam_size in Whisper?

temperature controls the randomness of token sampling, where 0.0 is deterministic and 1.0 is highly random. beam_size activates beam search, which is only used when temperature=0; the source code in whisper/transcribe.py automatically disables beam search when temperature is greater than zero to ensure compatible decoding strategies.

How does the fallback mechanism work when multiple temperatures are provided?

When temperature is a tuple, model.transcribe() iterates through each value sequentially. For each temperature, it decodes the audio and checks the compression_ratio_threshold, logprob_threshold, and no_speech_threshold. If any check fails, the loop proceeds to the next temperature; the first successful decode is returned as the final result.

Why would I disable condition_on_previous_text?

Setting condition_on_previous_text=False prevents the model from using prior transcription windows as context for the current window. This is useful when you want to avoid context contamination or "stuck" loops where the model repeats phrases across windows, though it may reduce coherence at segment boundaries.

Can I use word timestamps and beam search simultaneously?

Yes, but note that word_timestamps=True requires post-processing in whisper/timing.py regardless of the decoding strategy. However, beam search (beam_size > 1) is only compatible with temperature=0. If you specify a non-zero temperature, the implementation automatically switches to sampling mode and disables beam search.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →