# How to Extract Word-Level Timing Information in Whisper: A Complete Guide

> Extract word level timing in Whisper using the word_timestamps flag or API setting. Learn how to precisely align audio with text for accurate timing information.

- Repository: [OpenAI/whisper](https://github.com/openai/whisper)
- Tags: how-to-guide
- Published: 2026-02-27

---

**Enable word-level timestamps in OpenAI Whisper by passing the `--word_timestamps` flag in the CLI or setting `word_timestamps=True` in the Python API, which uses cross-attention weights and Dynamic Time Warping to align each word with its precise audio position.**

The openai/whisper repository provides a robust speech recognition system that transcribes audio with high accuracy. While standard transcription returns timestamps for entire segments, many applications—such as subtitle generation, video editing, and phonetic analysis—require granular **word-level timing information**. This guide explains how to activate Whisper's experimental word timestamp feature and details the underlying alignment pipeline implemented across the codebase.

## How Word-Level Timestamps Work in Whisper

Whisper's word-level timing system leverages the model's cross-attention mechanisms to align decoder tokens with encoder audio representations. When enabled, the system runs an additional alignment pass that captures attention weights from every decoder block, applies median filtering, and computes a Dynamic Time Warping (DTW) path to map each word to its precise temporal location in the audio stream.

## Enabling Word-Level Timing in Whisper

You can extract word-level timestamps through both command-line and programmatic interfaces. Both methods trigger the same underlying alignment pipeline in [`whisper/timing.py`](https://github.com/openai/whisper/blob/main/whisper/timing.py).

### Command-Line Interface (CLI) Method

The fastest way to extract word timestamps is using the built-in transcription module with the `--word_timestamps` flag:

```bash
python -m whisper.transcribe path/to/audio.wav \
    --model medium \
    --language en \
    --word_timestamps \
    --prepend_punctuations "\"'“¿([{-" \
    --append_punctuations "\"'.。,，!！?？:：”)]}、" \
    --output_json result.json

```

Key parameters include:
- `--word_timestamps`: Activates the alignment pipeline in [`whisper/transcribe.py`](https://github.com/openai/whisper/blob/main/whisper/transcribe.py) (line 401)
- `--prepend_punctuations`: Characters merged with the following word (default: `"\"'“¿([{-`)
- `--append_punctuations`: Characters merged with the preceding word (default: `"\"'.。,，!！?？:：”)]}、"`)

### Python API Method

For programmatic control, pass `word_timestamps=True` to the `transcribe` function:

```python
from whisper import load_model, transcribe

model = load_model("medium")
audio = "path/to/audio.wav"

result = transcribe(
    model,
    audio,
    language="en",
    word_timestamps=True,
    prepend_punctuations="\"'“¿([{-",
    append_punctuations="\"'.。,，!！?？:：”)]}、",
)

# Access word-level data

for segment in result["segments"]:
    for word in segment.get("words", []):
        print(f"{word['word']}: {word['start']:.2f}s - {word['end']:.2f}s")

```

## The Technical Pipeline: From Audio to Word Timestamps

When `word_timestamps=True`, Whisper executes a sophisticated alignment pipeline across several modules. The process transforms decoder tokens into temporally accurate word boundaries using cross-attention weights and Dynamic Time Warping.

### 1. Transcription Loop Initialization

In [`whisper/transcribe.py`](https://github.com/openai/whisper/blob/main/whisper/transcribe.py) (lines 401-410), the main transcription loop checks for the `word_timestamps` flag and invokes `add_word_timestamps` from [`whisper/timing.py`](https://github.com/openai/whisper/blob/main/whisper/timing.py) when enabled. This function serves as the entry point for the alignment process.

### 2. Cross-Attention Extraction

The `find_alignment` function in [`whisper/timing.py`](https://github.com/openai/whisper/blob/main/whisper/timing.py) (lines 63-71) attaches forward hooks to every decoder cross-attention block to capture Query-Key attention maps (`QKs`). This occurs at lines 85-92, where the hooks extract attention weights between text tokens and audio frames during the forward pass.

### 3. Dynamic Time Warping Alignment

The captured attention weights undergo normalization and median filtering before computing a **Dynamic Time Warping** (DTW) alignment between token indices and audio frames (lines 141-147 in [`whisper/timing.py`](https://github.com/openai/whisper/blob/main/whisper/timing.py)). This establishes the temporal mapping for each token by finding the optimal path through the attention matrix.

### 4. Token-to-Word Conversion

Using `split_to_word_tokens` from [`whisper/tokenizer.py`](https://github.com/openai/whisper/blob/main/whisper/tokenizer.py) (lines 77-85), the system groups tokens into words based on the tokenizer's word boundaries. The `split_tokens_on_spaces` helper (lines 111-126) handles space-delimited languages by identifying space tokens and splitting sequences accordingly.

### 5. Time Extraction and Data Structure Creation

For each word, the system derives start/end frames from the DTW path, converts them to seconds using `TOKENS_PER_SECOND`, and instantiates `WordTiming` dataclasses (lines 154-162 in [`whisper/timing.py`](https://github.com/openai/whisper/blob/main/whisper/timing.py)). The final output structure contains:

```json
{
  "word": "example",
  "start": 12.34,
  "end": 12.58,
  "probability": 0.93
}

```

### 6. Punctuation Merging and Post-Processing

The `merge_punctuations` function (lines 45-62 in [`whisper/timing.py`](https://github.com/openai/whisper/blob/main/whisper/timing.py)) handles `prepend_punctuations` and `append_punctuations`, attaching specified characters to adjacent words. Additional heuristics in `add_word_timestamps` truncate words exceeding median duration thresholds to prevent alignment errors caused by silence or noise.

### 7. Result Enrichment

Finally, the computed word entries attach to each segment under the `words` key (e.g., `segment["words"] = [...]`). This enrichment allows downstream consumers to access precise timing without reprocessing the audio.

## Processing Word Timestamps Programmatically

Once you have the transcription result, extract a flat list of word timestamps for export or analysis:

```python
word_times = [
    (w["word"], w["start"], w["end"], w["probability"])
    for seg in result["segments"]
    for w in seg.get("words", [])
]

# Export to CSV for further analysis

import csv
with open("word_times.csv", "w", newline="") as f:
    writer = csv.writer(f)
    writer.writerow(["word", "start_sec", "end_sec", "probability"])
    writer.writerows(word_times)

```

For subtitle generation with word highlighting, use the `--highlight_words` flag in the CLI. This leverages the timing data in [`whisper/utils.py`](https://github.com/openai/whisper/blob/main/whisper/utils.py) (around line 190) to underline each word as it is spoken, creating karaoke-style subtitles.

## Summary

- **Enable word-level timestamps** by passing `--word_timestamps` in the CLI or `word_timestamps=True` in the Python API.
- The feature uses **cross-attention weights** and **Dynamic Time Warping** (DTW) to align tokens with audio frames, implemented in [`whisper/timing.py`](https://github.com/openai/whisper/blob/main/whisper/timing.py).
- **Punctuation handling** is controlled via `prepend_punctuations` and `append_punctuations` parameters, which merge specified characters with adjacent words.
- Output includes **start time, end time, and probability** for each word, accessible via `segment["words"]` in the result dictionary.
- For visualization, combine with `--highlight_words` to generate subtitles that emphasize currently spoken words.

## Frequently Asked Questions

### What is the accuracy of Whisper's word-level timestamps?

Word-level timestamps rely on attention-based alignment rather than forced alignment, making them accurate for clear speech but potentially less precise for rapid speech or overlapping dialogue. The system applies median filtering and duration heuristics in [`whisper/timing.py`](https://github.com/openai/whisper/blob/main/whisper/timing.py) to correct obvious misalignments, but results should be validated against the audio for critical applications requiring frame-perfect accuracy.

### Can I use word timestamps with all Whisper model sizes?

Yes, word-level timestamps work with all model sizes (tiny, base, small, medium, large), though larger models generally produce more reliable alignments due to better attention patterns. The alignment computation in `find_alignment` is model-agnostic and operates on the cross-attention weights regardless of model depth or width, though processing time increases with model size due to larger attention matrices.

### How do I handle punctuation when extracting word timestamps?

Use the `prepend_punctuations` and `append_punctuations` arguments to control whether punctuation marks attach to words. By default, opening brackets and quotes prepend to the following word, while closing punctuation appends to the preceding word. This merging occurs in `merge_punctuations` within [`whisper/timing.py`](https://github.com/openai/whisper/blob/main/whisper/timing.py) before final time extraction, ensuring that punctuation inherits the timing of its associated word rather than standing alone.

### Why are some words missing timestamps or showing identical start/end times?

This typically occurs when the DTW alignment cannot confidently map a token to a specific audio frame, often due to silence, noise, or very short words. The `add_word_timestamps` function applies median-duration filtering to truncate or adjust words that exceed reasonable length thresholds, which can result in zero-duration entries for uncertain alignments. Checking the `probability` field helps identify low-confidence word timings that may require manual review.