# OpenAI Whisper model.transcribe() Result Dictionary Structure Explained

> **The `model.transcribe()` method returns a Python dictionary containing three top-level keys: `text` (the full transcription string), `segments` (a list of per-chunk dictionaries with timestamps and metadata), and `language` (...

- Repository: [OpenAI/whisper](https://github.com/openai/whisper)
- Tags: 
- Published: 2026-02-27

---

**The `model.transcribe()` method returns a Python dictionary containing three top-level keys: `text` (the full transcription string), `segments` (a list of per-chunk dictionaries with timestamps and metadata), and `language` (the detected ISO-639-1 language code).**

When building applications with the `openai/whisper` repository for automatic speech recognition, understanding the exact structure of the result dictionary returned by `model.transcribe()` is essential for extracting timestamps, handling word-level alignment, and processing diagnostic scores. This guide breaks down every field in the output based on the actual implementation in [`whisper/transcribe.py`](https://github.com/openai/whisper/blob/main/whisper/transcribe.py).

## Top-Level Keys in the Result Dictionary

The dictionary produced by `transcribe()` contains exactly three top-level entries that organize the transcription output.

### text

The **`text`** field contains a single Python `str` representing the complete transcription (or translation) obtained by decoding all output tokens. According to the source code in [`whisper/transcribe.py`](https://github.com/openai/whisper/blob/main/whisper/transcribe.py), this string explicitly excludes any tokens from the `initial_prompt` that may have been provided to guide the transcription【10†L1110-L1112】.

### segments

The **`segments`** field is a `list[dict]` where each dictionary represents a discrete audio chunk processed by the model. This list provides granular access to timing, token-level data, and quality metrics for every segment of the audio file【10†L525-L561】.

### language

The **`language`** field contains a `str` representing the ISO-639-1 language code (e.g., `"en"`, `"es"`, `"fr"`) that was either detected automatically or explicitly supplied via the `language` parameter【10†L1250-L1254】.

## Deep Dive: The segments List

Each dictionary in the `segments` list contains multiple fields that provide both the transcribed content and diagnostic information about the decoding process.

### Core Timing and Content Fields

Every segment dictionary includes the following essential fields:

- **`seek`**: An integer representing the frame index where the segment starts in the audio tensor.
- **`start`**: A float indicating the start timestamp in seconds.
- **`end`**: A float indicating the end timestamp in seconds.
- **`text`**: The decoded text string for this specific audio chunk.
- **`tokens`**: A list of integer token IDs produced by the decoder for this segment.

### Diagnostic and Quality Metrics

The segment dictionaries also include several diagnostic scores used for fallback handling and quality assessment:

- **`temperature`**: The sampling temperature used when decoding this segment (float).
- **`avg_logprob`**: The average log probability of tokens in the segment, used to detect low-confidence transcriptions (float).
- **`compression_ratio`**: The ratio of gzip compressed length to raw length, used to detect repetitive generation (float).
- **`no_speech_prob`**: The probability that the segment contains no speech, used for voice activity detection (float).

### Optional Word-Level Timestamps

When the `word_timestamps=True` parameter is passed to `transcribe()`, an additional **`words`** field is added to each segment dictionary. This field contains a list of word objects generated by the `add_word_timestamps` function in [`whisper/timing.py`](https://github.com/openai/whisper/blob/main/whisper/timing.py).

Each word object includes:

- **`word`**: The text of the word (string).
- **`start`**: Start timestamp in seconds (float).
- **`end`**: End timestamp in seconds (float).
- **`probability`**: The probability score for the word token (float).

## Working with the Transcription Output

Here are practical examples for accessing the data structure returned by `model.transcribe()`:

```python
from whisper import load_model

# Load model and transcribe with word timestamps

model = load_model("base")
result = model.transcribe("audio.wav", word_timestamps=True)

# Access the full transcription text

full_text = result["text"]
print(f"Transcription: {full_text}")

# Iterate through segments with timestamps

for segment in result["segments"]:
    start = segment["start"]
    end = segment["end"]
    text = segment["text"]
    print(f"[{start:.2f}s → {end:.2f}s] {text}")

# Access per-word timestamps if enabled

if result["segments"] and "words" in result["segments"][0]:
    for segment in result["segments"]:
        for word_info in segment["words"]:
            word = word_info["word"]
            w_start = word_info["start"]
            w_end = word_info["end"]
            prob = word_info["probability"]
            print(f"{word} [{w_start:.2f}s–{w_end:.2f}s] (p={prob:.2f})")

```

## Source Code Implementation Details

The structure of the result dictionary is defined across several files in the `openai/whisper` repository:

- **[`whisper/transcribe.py`](https://github.com/openai/whisper/blob/main/whisper/transcribe.py)**: Contains the main `transcribe()` function that constructs the return dictionary with `text`, `segments`, and `language` keys【10†L1110-L1112】【10†L1250-L1254】.
- **[`whisper/timing.py`](https://github.com/openai/whisper/blob/main/whisper/timing.py)**: Implements `add_word_timestamps()`, which adds the optional `words` field to segments when word-level timestamps are requested.
- **[`whisper/decoding.py`](https://github.com/openai/whisper/blob/main/whisper/decoding.py)**: Defines `DecodingResult` and `DecodingOptions` classes that provide the diagnostic scores (temperature, avg_logprob, compression_ratio, no_speech_prob) included in each segment.
- **[`whisper/model.py`](https://github.com/openai/whisper/blob/main/whisper/model.py)**: Defines the `Whisper` class whose `transcribe` method forwards to the helper function in [`transcribe.py`](https://github.com/openai/whisper/blob/main/transcribe.py).

## Summary

- The **`model.transcribe()`** method returns a dictionary with three keys: **`text`** (full transcription), **`segments`** (list of per-chunk metadata), and **`language`** (ISO-639-1 code).
- Each **segment** contains timing fields (`seek`, `start`, `end`), content fields (`text`, `tokens`), and diagnostic scores (`temperature`, `avg_logprob`, `compression_ratio`, `no_speech_prob`).
- When **`word_timestamps=True`** is passed, each segment includes a **`words`** list with per-word timestamps and probabilities.
- The structure is implemented in **[`whisper/transcribe.py`](https://github.com/openai/whisper/blob/main/whisper/transcribe.py)** with optional word timestamps added by **[`whisper/timing.py`](https://github.com/openai/whisper/blob/main/whisper/timing.py)**.

## Frequently Asked Questions

### What data type does model.transcribe() return?

The `model.transcribe()` method returns a standard Python **`dict`** (dictionary), not a custom class or JSON string. This dictionary contains three top-level string keys: `"text"`, `"segments"`, and `"language"`, making it easy to serialize with `json.dumps()` if needed.

### How do I extract per-word timestamps from the transcription result?

To access word-level timing, pass **`word_timestamps=True`** when calling `transcribe()`. Each dictionary in the `segments` list will then contain an additional `"words"` key mapping to a list of word objects. Each word object includes `"word"` (the text), `"start"` and `"end"` timestamps in seconds, and `"probability"` (the model's confidence score).

### What is the difference between the seek and start fields in a segment?

The **`seek`** field is an integer representing the frame index in the audio tensor where processing for that segment began, used internally for windowing. The **`start`** field is a float representing the actual timestamp in seconds when the spoken content begins, which is the value you should display to users or use for subtitle timing.

### Where in the source code is the result dictionary constructed?

The result dictionary is assembled in the **`transcribe()`** function within **[`whisper/transcribe.py`](https://github.com/openai/whisper/blob/main/whisper/transcribe.py)**. Specifically, the dictionary is built at the end of the function where the `text`, `segments`, and `language` keys are populated. The optional `words` field is added later by the `add_word_timestamps()` function in **[`whisper/timing.py`](https://github.com/openai/whisper/blob/main/whisper/timing.py)** when word timestamps are enabled.