How to Extract Word-Level Timing Information in Whisper: A Complete Guide
Enable word-level timestamps in OpenAI Whisper by passing the --word_timestamps flag in the CLI or setting word_timestamps=True in the Python API, which uses cross-attention weights and Dynamic Time Warping to align each word with its precise audio position.
The openai/whisper repository provides a robust speech recognition system that transcribes audio with high accuracy. While standard transcription returns timestamps for entire segments, many applications—such as subtitle generation, video editing, and phonetic analysis—require granular word-level timing information. This guide explains how to activate Whisper's experimental word timestamp feature and details the underlying alignment pipeline implemented across the codebase.
How Word-Level Timestamps Work in Whisper
Whisper's word-level timing system leverages the model's cross-attention mechanisms to align decoder tokens with encoder audio representations. When enabled, the system runs an additional alignment pass that captures attention weights from every decoder block, applies median filtering, and computes a Dynamic Time Warping (DTW) path to map each word to its precise temporal location in the audio stream.
Enabling Word-Level Timing in Whisper
You can extract word-level timestamps through both command-line and programmatic interfaces. Both methods trigger the same underlying alignment pipeline in whisper/timing.py.
Command-Line Interface (CLI) Method
The fastest way to extract word timestamps is using the built-in transcription module with the --word_timestamps flag:
python -m whisper.transcribe path/to/audio.wav \
--model medium \
--language en \
--word_timestamps \
--prepend_punctuations "\"'“¿([{-" \
--append_punctuations "\"'.。,,!!??::”)]}、" \
--output_json result.json
Key parameters include:
--word_timestamps: Activates the alignment pipeline inwhisper/transcribe.py(line 401)--prepend_punctuations: Characters merged with the following word (default:"\"'“¿([{-)--append_punctuations: Characters merged with the preceding word (default:"\"'.。,,!!??::”)]}、")
Python API Method
For programmatic control, pass word_timestamps=True to the transcribe function:
from whisper import load_model, transcribe
model = load_model("medium")
audio = "path/to/audio.wav"
result = transcribe(
model,
audio,
language="en",
word_timestamps=True,
prepend_punctuations="\"'“¿([{-",
append_punctuations="\"'.。,,!!??::”)]}、",
)
# Access word-level data
for segment in result["segments"]:
for word in segment.get("words", []):
print(f"{word['word']}: {word['start']:.2f}s - {word['end']:.2f}s")
The Technical Pipeline: From Audio to Word Timestamps
When word_timestamps=True, Whisper executes a sophisticated alignment pipeline across several modules. The process transforms decoder tokens into temporally accurate word boundaries using cross-attention weights and Dynamic Time Warping.
1. Transcription Loop Initialization
In whisper/transcribe.py (lines 401-410), the main transcription loop checks for the word_timestamps flag and invokes add_word_timestamps from whisper/timing.py when enabled. This function serves as the entry point for the alignment process.
2. Cross-Attention Extraction
The find_alignment function in whisper/timing.py (lines 63-71) attaches forward hooks to every decoder cross-attention block to capture Query-Key attention maps (QKs). This occurs at lines 85-92, where the hooks extract attention weights between text tokens and audio frames during the forward pass.
3. Dynamic Time Warping Alignment
The captured attention weights undergo normalization and median filtering before computing a Dynamic Time Warping (DTW) alignment between token indices and audio frames (lines 141-147 in whisper/timing.py). This establishes the temporal mapping for each token by finding the optimal path through the attention matrix.
4. Token-to-Word Conversion
Using split_to_word_tokens from whisper/tokenizer.py (lines 77-85), the system groups tokens into words based on the tokenizer's word boundaries. The split_tokens_on_spaces helper (lines 111-126) handles space-delimited languages by identifying space tokens and splitting sequences accordingly.
5. Time Extraction and Data Structure Creation
For each word, the system derives start/end frames from the DTW path, converts them to seconds using TOKENS_PER_SECOND, and instantiates WordTiming dataclasses (lines 154-162 in whisper/timing.py). The final output structure contains:
{
"word": "example",
"start": 12.34,
"end": 12.58,
"probability": 0.93
}
6. Punctuation Merging and Post-Processing
The merge_punctuations function (lines 45-62 in whisper/timing.py) handles prepend_punctuations and append_punctuations, attaching specified characters to adjacent words. Additional heuristics in add_word_timestamps truncate words exceeding median duration thresholds to prevent alignment errors caused by silence or noise.
7. Result Enrichment
Finally, the computed word entries attach to each segment under the words key (e.g., segment["words"] = [...]). This enrichment allows downstream consumers to access precise timing without reprocessing the audio.
Processing Word Timestamps Programmatically
Once you have the transcription result, extract a flat list of word timestamps for export or analysis:
word_times = [
(w["word"], w["start"], w["end"], w["probability"])
for seg in result["segments"]
for w in seg.get("words", [])
]
# Export to CSV for further analysis
import csv
with open("word_times.csv", "w", newline="") as f:
writer = csv.writer(f)
writer.writerow(["word", "start_sec", "end_sec", "probability"])
writer.writerows(word_times)
For subtitle generation with word highlighting, use the --highlight_words flag in the CLI. This leverages the timing data in whisper/utils.py (around line 190) to underline each word as it is spoken, creating karaoke-style subtitles.
Summary
- Enable word-level timestamps by passing
--word_timestampsin the CLI orword_timestamps=Truein the Python API. - The feature uses cross-attention weights and Dynamic Time Warping (DTW) to align tokens with audio frames, implemented in
whisper/timing.py. - Punctuation handling is controlled via
prepend_punctuationsandappend_punctuationsparameters, which merge specified characters with adjacent words. - Output includes start time, end time, and probability for each word, accessible via
segment["words"]in the result dictionary. - For visualization, combine with
--highlight_wordsto generate subtitles that emphasize currently spoken words.
Frequently Asked Questions
What is the accuracy of Whisper's word-level timestamps?
Word-level timestamps rely on attention-based alignment rather than forced alignment, making them accurate for clear speech but potentially less precise for rapid speech or overlapping dialogue. The system applies median filtering and duration heuristics in whisper/timing.py to correct obvious misalignments, but results should be validated against the audio for critical applications requiring frame-perfect accuracy.
Can I use word timestamps with all Whisper model sizes?
Yes, word-level timestamps work with all model sizes (tiny, base, small, medium, large), though larger models generally produce more reliable alignments due to better attention patterns. The alignment computation in find_alignment is model-agnostic and operates on the cross-attention weights regardless of model depth or width, though processing time increases with model size due to larger attention matrices.
How do I handle punctuation when extracting word timestamps?
Use the prepend_punctuations and append_punctuations arguments to control whether punctuation marks attach to words. By default, opening brackets and quotes prepend to the following word, while closing punctuation appends to the preceding word. This merging occurs in merge_punctuations within whisper/timing.py before final time extraction, ensuring that punctuation inherits the timing of its associated word rather than standing alone.
Why are some words missing timestamps or showing identical start/end times?
This typically occurs when the DTW alignment cannot confidently map a token to a specific audio frame, often due to silence, noise, or very short words. The add_word_timestamps function applies median-duration filtering to truncate or adjust words that exceed reasonable length thresholds, which can result in zero-duration entries for uncertain alignments. Checking the probability field helps identify low-confidence word timings that may require manual review.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →