OpenAI Whisper model.transcribe() Result Dictionary Structure Explained
The model.transcribe() method returns a Python dictionary containing three top-level keys: text (the full transcription string), segments (a list of per-chunk dictionaries with timestamps and metadata), and language (the detected ISO-639-1 language code).
When building applications with the openai/whisper repository for automatic speech recognition, understanding the exact structure of the result dictionary returned by model.transcribe() is essential for extracting timestamps, handling word-level alignment, and processing diagnostic scores. This guide breaks down every field in the output based on the actual implementation in whisper/transcribe.py.
Top-Level Keys in the Result Dictionary
The dictionary produced by transcribe() contains exactly three top-level entries that organize the transcription output.
text
The text field contains a single Python str representing the complete transcription (or translation) obtained by decoding all output tokens. According to the source code in whisper/transcribe.py, this string explicitly excludes any tokens from the initial_prompt that may have been provided to guide the transcription【10†L1110-L1112】.
segments
The segments field is a list[dict] where each dictionary represents a discrete audio chunk processed by the model. This list provides granular access to timing, token-level data, and quality metrics for every segment of the audio file【10†L525-L561】.
language
The language field contains a str representing the ISO-639-1 language code (e.g., "en", "es", "fr") that was either detected automatically or explicitly supplied via the language parameter【10†L1250-L1254】.
Deep Dive: The segments List
Each dictionary in the segments list contains multiple fields that provide both the transcribed content and diagnostic information about the decoding process.
Core Timing and Content Fields
Every segment dictionary includes the following essential fields:
seek: An integer representing the frame index where the segment starts in the audio tensor.start: A float indicating the start timestamp in seconds.end: A float indicating the end timestamp in seconds.text: The decoded text string for this specific audio chunk.tokens: A list of integer token IDs produced by the decoder for this segment.
Diagnostic and Quality Metrics
The segment dictionaries also include several diagnostic scores used for fallback handling and quality assessment:
temperature: The sampling temperature used when decoding this segment (float).avg_logprob: The average log probability of tokens in the segment, used to detect low-confidence transcriptions (float).compression_ratio: The ratio of gzip compressed length to raw length, used to detect repetitive generation (float).no_speech_prob: The probability that the segment contains no speech, used for voice activity detection (float).
Optional Word-Level Timestamps
When the word_timestamps=True parameter is passed to transcribe(), an additional words field is added to each segment dictionary. This field contains a list of word objects generated by the add_word_timestamps function in whisper/timing.py.
Each word object includes:
word: The text of the word (string).start: Start timestamp in seconds (float).end: End timestamp in seconds (float).probability: The probability score for the word token (float).
Working with the Transcription Output
Here are practical examples for accessing the data structure returned by model.transcribe():
from whisper import load_model
# Load model and transcribe with word timestamps
model = load_model("base")
result = model.transcribe("audio.wav", word_timestamps=True)
# Access the full transcription text
full_text = result["text"]
print(f"Transcription: {full_text}")
# Iterate through segments with timestamps
for segment in result["segments"]:
start = segment["start"]
end = segment["end"]
text = segment["text"]
print(f"[{start:.2f}s → {end:.2f}s] {text}")
# Access per-word timestamps if enabled
if result["segments"] and "words" in result["segments"][0]:
for segment in result["segments"]:
for word_info in segment["words"]:
word = word_info["word"]
w_start = word_info["start"]
w_end = word_info["end"]
prob = word_info["probability"]
print(f"{word} [{w_start:.2f}s–{w_end:.2f}s] (p={prob:.2f})")
Source Code Implementation Details
The structure of the result dictionary is defined across several files in the openai/whisper repository:
whisper/transcribe.py: Contains the maintranscribe()function that constructs the return dictionary withtext,segments, andlanguagekeys【10†L1110-L1112】【10†L1250-L1254】.whisper/timing.py: Implementsadd_word_timestamps(), which adds the optionalwordsfield to segments when word-level timestamps are requested.whisper/decoding.py: DefinesDecodingResultandDecodingOptionsclasses that provide the diagnostic scores (temperature, avg_logprob, compression_ratio, no_speech_prob) included in each segment.whisper/model.py: Defines theWhisperclass whosetranscribemethod forwards to the helper function intranscribe.py.
Summary
- The
model.transcribe()method returns a dictionary with three keys:text(full transcription),segments(list of per-chunk metadata), andlanguage(ISO-639-1 code). - Each segment contains timing fields (
seek,start,end), content fields (text,tokens), and diagnostic scores (temperature,avg_logprob,compression_ratio,no_speech_prob). - When
word_timestamps=Trueis passed, each segment includes awordslist with per-word timestamps and probabilities. - The structure is implemented in
whisper/transcribe.pywith optional word timestamps added bywhisper/timing.py.
Frequently Asked Questions
What data type does model.transcribe() return?
The model.transcribe() method returns a standard Python dict (dictionary), not a custom class or JSON string. This dictionary contains three top-level string keys: "text", "segments", and "language", making it easy to serialize with json.dumps() if needed.
How do I extract per-word timestamps from the transcription result?
To access word-level timing, pass word_timestamps=True when calling transcribe(). Each dictionary in the segments list will then contain an additional "words" key mapping to a list of word objects. Each word object includes "word" (the text), "start" and "end" timestamps in seconds, and "probability" (the model's confidence score).
What is the difference between the seek and start fields in a segment?
The seek field is an integer representing the frame index in the audio tensor where processing for that segment began, used internally for windowing. The start field is a float representing the actual timestamp in seconds when the spoken content begins, which is the value you should display to users or use for subtitle timing.
Where in the source code is the result dictionary constructed?
The result dictionary is assembled in the transcribe() function within whisper/transcribe.py. Specifically, the dictionary is built at the end of the function where the text, segments, and language keys are populated. The optional words field is added later by the add_word_timestamps() function in whisper/timing.py when word timestamps are enabled.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →