How to Extract Word-Level Timing Information in Whisper: A Complete Guide

Enable word-level timestamps in OpenAI Whisper by passing the --word_timestamps flag in the CLI or setting word_timestamps=True in the Python API, which uses cross-attention weights and Dynamic Time Warping to align each word with its precise audio position.

The openai/whisper repository provides a robust speech recognition system that transcribes audio with high accuracy. While standard transcription returns timestamps for entire segments, many applications—such as subtitle generation, video editing, and phonetic analysis—require granular word-level timing information. This guide explains how to activate Whisper's experimental word timestamp feature and details the underlying alignment pipeline implemented across the codebase.

How Word-Level Timestamps Work in Whisper

Whisper's word-level timing system leverages the model's cross-attention mechanisms to align decoder tokens with encoder audio representations. When enabled, the system runs an additional alignment pass that captures attention weights from every decoder block, applies median filtering, and computes a Dynamic Time Warping (DTW) path to map each word to its precise temporal location in the audio stream.

Enabling Word-Level Timing in Whisper

You can extract word-level timestamps through both command-line and programmatic interfaces. Both methods trigger the same underlying alignment pipeline in whisper/timing.py.

Command-Line Interface (CLI) Method

The fastest way to extract word timestamps is using the built-in transcription module with the --word_timestamps flag:

python -m whisper.transcribe path/to/audio.wav \
    --model medium \
    --language en \
    --word_timestamps \
    --prepend_punctuations "\"'“¿([{-" \
    --append_punctuations "\"'.。,,!!??::”)]}、" \
    --output_json result.json

Key parameters include:

  • --word_timestamps: Activates the alignment pipeline in whisper/transcribe.py (line 401)
  • --prepend_punctuations: Characters merged with the following word (default: "\"'“¿([{-)
  • --append_punctuations: Characters merged with the preceding word (default: "\"'.。,,!!??::”)]}、")

Python API Method

For programmatic control, pass word_timestamps=True to the transcribe function:

from whisper import load_model, transcribe

model = load_model("medium")
audio = "path/to/audio.wav"

result = transcribe(
    model,
    audio,
    language="en",
    word_timestamps=True,
    prepend_punctuations="\"'“¿([{-",
    append_punctuations="\"'.。,,!!??::”)]}、",
)

# Access word-level data

for segment in result["segments"]:
    for word in segment.get("words", []):
        print(f"{word['word']}: {word['start']:.2f}s - {word['end']:.2f}s")

The Technical Pipeline: From Audio to Word Timestamps

When word_timestamps=True, Whisper executes a sophisticated alignment pipeline across several modules. The process transforms decoder tokens into temporally accurate word boundaries using cross-attention weights and Dynamic Time Warping.

1. Transcription Loop Initialization

In whisper/transcribe.py (lines 401-410), the main transcription loop checks for the word_timestamps flag and invokes add_word_timestamps from whisper/timing.py when enabled. This function serves as the entry point for the alignment process.

2. Cross-Attention Extraction

The find_alignment function in whisper/timing.py (lines 63-71) attaches forward hooks to every decoder cross-attention block to capture Query-Key attention maps (QKs). This occurs at lines 85-92, where the hooks extract attention weights between text tokens and audio frames during the forward pass.

3. Dynamic Time Warping Alignment

The captured attention weights undergo normalization and median filtering before computing a Dynamic Time Warping (DTW) alignment between token indices and audio frames (lines 141-147 in whisper/timing.py). This establishes the temporal mapping for each token by finding the optimal path through the attention matrix.

4. Token-to-Word Conversion

Using split_to_word_tokens from whisper/tokenizer.py (lines 77-85), the system groups tokens into words based on the tokenizer's word boundaries. The split_tokens_on_spaces helper (lines 111-126) handles space-delimited languages by identifying space tokens and splitting sequences accordingly.

5. Time Extraction and Data Structure Creation

For each word, the system derives start/end frames from the DTW path, converts them to seconds using TOKENS_PER_SECOND, and instantiates WordTiming dataclasses (lines 154-162 in whisper/timing.py). The final output structure contains:

{
  "word": "example",
  "start": 12.34,
  "end": 12.58,
  "probability": 0.93
}

6. Punctuation Merging and Post-Processing

The merge_punctuations function (lines 45-62 in whisper/timing.py) handles prepend_punctuations and append_punctuations, attaching specified characters to adjacent words. Additional heuristics in add_word_timestamps truncate words exceeding median duration thresholds to prevent alignment errors caused by silence or noise.

7. Result Enrichment

Finally, the computed word entries attach to each segment under the words key (e.g., segment["words"] = [...]). This enrichment allows downstream consumers to access precise timing without reprocessing the audio.

Processing Word Timestamps Programmatically

Once you have the transcription result, extract a flat list of word timestamps for export or analysis:

word_times = [
    (w["word"], w["start"], w["end"], w["probability"])
    for seg in result["segments"]
    for w in seg.get("words", [])
]

# Export to CSV for further analysis

import csv
with open("word_times.csv", "w", newline="") as f:
    writer = csv.writer(f)
    writer.writerow(["word", "start_sec", "end_sec", "probability"])
    writer.writerows(word_times)

For subtitle generation with word highlighting, use the --highlight_words flag in the CLI. This leverages the timing data in whisper/utils.py (around line 190) to underline each word as it is spoken, creating karaoke-style subtitles.

Summary

  • Enable word-level timestamps by passing --word_timestamps in the CLI or word_timestamps=True in the Python API.
  • The feature uses cross-attention weights and Dynamic Time Warping (DTW) to align tokens with audio frames, implemented in whisper/timing.py.
  • Punctuation handling is controlled via prepend_punctuations and append_punctuations parameters, which merge specified characters with adjacent words.
  • Output includes start time, end time, and probability for each word, accessible via segment["words"] in the result dictionary.
  • For visualization, combine with --highlight_words to generate subtitles that emphasize currently spoken words.

Frequently Asked Questions

What is the accuracy of Whisper's word-level timestamps?

Word-level timestamps rely on attention-based alignment rather than forced alignment, making them accurate for clear speech but potentially less precise for rapid speech or overlapping dialogue. The system applies median filtering and duration heuristics in whisper/timing.py to correct obvious misalignments, but results should be validated against the audio for critical applications requiring frame-perfect accuracy.

Can I use word timestamps with all Whisper model sizes?

Yes, word-level timestamps work with all model sizes (tiny, base, small, medium, large), though larger models generally produce more reliable alignments due to better attention patterns. The alignment computation in find_alignment is model-agnostic and operates on the cross-attention weights regardless of model depth or width, though processing time increases with model size due to larger attention matrices.

How do I handle punctuation when extracting word timestamps?

Use the prepend_punctuations and append_punctuations arguments to control whether punctuation marks attach to words. By default, opening brackets and quotes prepend to the following word, while closing punctuation appends to the preceding word. This merging occurs in merge_punctuations within whisper/timing.py before final time extraction, ensuring that punctuation inherits the timing of its associated word rather than standing alone.

Why are some words missing timestamps or showing identical start/end times?

This typically occurs when the DTW alignment cannot confidently map a token to a specific audio frame, often due to silence, noise, or very short words. The add_word_timestamps function applies median-duration filtering to truncate or adjust words that exceed reasonable length thresholds, which can result in zero-duration entries for uncertain alignments. Checking the probability field helps identify low-confidence word timings that may require manual review.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →