OpenAI Whisper model.transcribe() Result Dictionary Structure Explained

The model.transcribe() method returns a Python dictionary containing three top-level keys: text (the full transcription string), segments (a list of per-chunk dictionaries with timestamps and metadata), and language (the detected ISO-639-1 language code).

When building applications with the openai/whisper repository for automatic speech recognition, understanding the exact structure of the result dictionary returned by model.transcribe() is essential for extracting timestamps, handling word-level alignment, and processing diagnostic scores. This guide breaks down every field in the output based on the actual implementation in whisper/transcribe.py.

Top-Level Keys in the Result Dictionary

The dictionary produced by transcribe() contains exactly three top-level entries that organize the transcription output.

text

The text field contains a single Python str representing the complete transcription (or translation) obtained by decoding all output tokens. According to the source code in whisper/transcribe.py, this string explicitly excludes any tokens from the initial_prompt that may have been provided to guide the transcription【10†L1110-L1112】.

segments

The segments field is a list[dict] where each dictionary represents a discrete audio chunk processed by the model. This list provides granular access to timing, token-level data, and quality metrics for every segment of the audio file【10†L525-L561】.

language

The language field contains a str representing the ISO-639-1 language code (e.g., "en", "es", "fr") that was either detected automatically or explicitly supplied via the language parameter【10†L1250-L1254】.

Deep Dive: The segments List

Each dictionary in the segments list contains multiple fields that provide both the transcribed content and diagnostic information about the decoding process.

Core Timing and Content Fields

Every segment dictionary includes the following essential fields:

  • seek: An integer representing the frame index where the segment starts in the audio tensor.
  • start: A float indicating the start timestamp in seconds.
  • end: A float indicating the end timestamp in seconds.
  • text: The decoded text string for this specific audio chunk.
  • tokens: A list of integer token IDs produced by the decoder for this segment.

Diagnostic and Quality Metrics

The segment dictionaries also include several diagnostic scores used for fallback handling and quality assessment:

  • temperature: The sampling temperature used when decoding this segment (float).
  • avg_logprob: The average log probability of tokens in the segment, used to detect low-confidence transcriptions (float).
  • compression_ratio: The ratio of gzip compressed length to raw length, used to detect repetitive generation (float).
  • no_speech_prob: The probability that the segment contains no speech, used for voice activity detection (float).

Optional Word-Level Timestamps

When the word_timestamps=True parameter is passed to transcribe(), an additional words field is added to each segment dictionary. This field contains a list of word objects generated by the add_word_timestamps function in whisper/timing.py.

Each word object includes:

  • word: The text of the word (string).
  • start: Start timestamp in seconds (float).
  • end: End timestamp in seconds (float).
  • probability: The probability score for the word token (float).

Working with the Transcription Output

Here are practical examples for accessing the data structure returned by model.transcribe():

from whisper import load_model

# Load model and transcribe with word timestamps

model = load_model("base")
result = model.transcribe("audio.wav", word_timestamps=True)

# Access the full transcription text

full_text = result["text"]
print(f"Transcription: {full_text}")

# Iterate through segments with timestamps

for segment in result["segments"]:
    start = segment["start"]
    end = segment["end"]
    text = segment["text"]
    print(f"[{start:.2f}s → {end:.2f}s] {text}")

# Access per-word timestamps if enabled

if result["segments"] and "words" in result["segments"][0]:
    for segment in result["segments"]:
        for word_info in segment["words"]:
            word = word_info["word"]
            w_start = word_info["start"]
            w_end = word_info["end"]
            prob = word_info["probability"]
            print(f"{word} [{w_start:.2f}s–{w_end:.2f}s] (p={prob:.2f})")

Source Code Implementation Details

The structure of the result dictionary is defined across several files in the openai/whisper repository:

  • whisper/transcribe.py: Contains the main transcribe() function that constructs the return dictionary with text, segments, and language keys【10†L1110-L1112】【10†L1250-L1254】.
  • whisper/timing.py: Implements add_word_timestamps(), which adds the optional words field to segments when word-level timestamps are requested.
  • whisper/decoding.py: Defines DecodingResult and DecodingOptions classes that provide the diagnostic scores (temperature, avg_logprob, compression_ratio, no_speech_prob) included in each segment.
  • whisper/model.py: Defines the Whisper class whose transcribe method forwards to the helper function in transcribe.py.

Summary

  • The model.transcribe() method returns a dictionary with three keys: text (full transcription), segments (list of per-chunk metadata), and language (ISO-639-1 code).
  • Each segment contains timing fields (seek, start, end), content fields (text, tokens), and diagnostic scores (temperature, avg_logprob, compression_ratio, no_speech_prob).
  • When word_timestamps=True is passed, each segment includes a words list with per-word timestamps and probabilities.
  • The structure is implemented in whisper/transcribe.py with optional word timestamps added by whisper/timing.py.

Frequently Asked Questions

What data type does model.transcribe() return?

The model.transcribe() method returns a standard Python dict (dictionary), not a custom class or JSON string. This dictionary contains three top-level string keys: "text", "segments", and "language", making it easy to serialize with json.dumps() if needed.

How do I extract per-word timestamps from the transcription result?

To access word-level timing, pass word_timestamps=True when calling transcribe(). Each dictionary in the segments list will then contain an additional "words" key mapping to a list of word objects. Each word object includes "word" (the text), "start" and "end" timestamps in seconds, and "probability" (the model's confidence score).

What is the difference between the seek and start fields in a segment?

The seek field is an integer representing the frame index in the audio tensor where processing for that segment began, used internally for windowing. The start field is a float representing the actual timestamp in seconds when the spoken content begins, which is the value you should display to users or use for subtitle timing.

Where in the source code is the result dictionary constructed?

The result dictionary is assembled in the transcribe() function within whisper/transcribe.py. Specifically, the dictionary is built at the end of the function where the text, segments, and language keys are populated. The optional words field is added later by the add_word_timestamps() function in whisper/timing.py when word timestamps are enabled.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →