How to Enable and Customize Word-Level Timestamps in Whisper

Enable word-level timestamps by setting word_timestamps=True in the Python API or --word_timestamps true in the CLI, then customize behavior via prepend_punctuations, append_punctuations, and hallucination_silence_threshold parameters.

Whisper supports experimental word-level timestamps that pinpoint exactly when each word occurs in audio, going beyond standard segment-level timing. This feature aligns decoded tokens to the mel-spectrogram using the model's cross-attention weights and dynamic time warping (DTW). According to the OpenAI Whisper source code, the entry point resides in whisper/transcribe.py while the core alignment logic lives in whisper/timing.py.

How Word-Level Timestamps Work

When activated, Whisper performs a post-processing alignment step that maps each token to its corresponding audio frame.

Token-to-Audio Alignment with DTW

The find_alignment function in whisper/timing.py (lines 63-71) extracts cross-attention weights from every decoder layer using a forward hook. It normalizes these weights, applies a median filter to reduce noise, and runs a dynamic time warping (DTW) algorithm to align the token sequence against the mel-spectrogram frames. Each token receives a start and end time based on the DTW path, which are then grouped into words using the tokenizer's split_to_word_tokens method.

Punctuation Merging Logic

After alignment, the merge_punctuations function (lines 45-52 in whisper/timing.py) attaches punctuation marks to adjacent words. Leading punctuation (opening quotes, brackets) attaches to the following word, while trailing punctuation (periods, commas, closing brackets) attaches to the preceding word. This prevents isolated punctuation marks from receiving separate timestamps.

Enabling Word-Level Timestamps via Python API

In whisper/transcribe.py (lines 50-57), the transcribe function accepts a word_timestamps boolean parameter that defaults to False. When set to True, the function invokes add_word_timestamps after decoding to populate the "words" field in each segment.

import whisper

model = whisper.load_model("base")
result = whisper.transcribe(
    model,
    "audio.wav",
    word_timestamps=True,                       # Enable the feature

    prepend_punctuations="\"'‘“([{-",           # Custom leading punctuation

    append_punctuations="\"'.。,,!!??::”)]}、", # Custom trailing punctuation

    hallucination_silence_threshold=0.5,        # Skip silences > 0.5s

)

# Each segment contains a "words" list with precise timings

for word in result["segments"][0]["words"]:
    print(f"{word['word']}: {word['start']:.2f}s - {word['end']:.2f}s")

The resulting word objects contain word, start, end, and probability keys, with timestamps in seconds.

Command-Line Interface Usage

The CLI parser in whisper/transcribe.py (lines 56-58) exposes the --word_timestamps flag. You can combine this with punctuation customization options and the hallucination threshold:

whisper audio.wav --model medium \
    --word_timestamps true \
    --prepend_punctuations "\"'‘“" \
    --append_punctuations "!?,.;" \
    --hallucination_silence_threshold 0.5

To generate karaoke-style subtitles with highlighted active words, add the --highlight_words true flag (validated in whisper/transcribe.py lines 104-108). This option requires word timestamps to function.

Customizing Timestamp Behavior

Configuring Punctuation Attachment

The default punctuation sets in whisper/transcribe.py (lines 51-53) define which characters merge with words:

  • Prepend: "\"'“¿([{-" (opening quotes, brackets)
  • Append: "\"'.。,,!!??::”)]}、" (closing punctuation, CJK delimiters)

Override these via the prepend_punctuations and append_punctuations arguments to handle language-specific conventions or exclude certain marks from merging.

Hallucination Silence Threshold

The hallucination_silence_threshold parameter (implemented in whisper/timing.py lines 19-22) filters out timestamp segments that exceed a specified duration of silence. Setting this to 0.5 skips gaps longer than half a second, removing phantom words that often appear during musical interludes or background noise.

Summary

  • Enable the feature by setting word_timestamps=True in Python or --word_timestamps true via CLI.
  • Core implementation lives in whisper/timing.py (add_word_timestamps, find_alignment, merge_punctuations).
  • Customize punctuation using prepend_punctuations and append_punctuations to control how quotes and brackets attach to words.
  • Filter hallucinations by setting hallucination_silence_threshold to skip implausibly long silent gaps.
  • Output format appears as a "words" list within each segment dictionary, containing word, start, end, and probability fields.

Frequently Asked Questions

What is the performance impact of enabling word-level timestamps?

Enabling word-level timestamps adds computational overhead because Whisper must extract cross-attention weights from all decoder layers and run the DTW algorithm on every segment. Expect roughly 10-20% slower transcription depending on audio length and model size, as the alignment process requires additional forward passes through the attention mechanism.

Can I use word-level timestamps with all Whisper model sizes?

Yes, word-level timestamps work with all Whisper model sizes (tiny, base, small, medium, large). The alignment quality generally improves with larger models because cross-attention weights more accurately correlate tokens with audio features, though the DTW algorithm in find_alignment functions identically across all variants.

How does the punctuation merging algorithm work?

The merge_punctuations function iterates through tokens and checks if they exist in the prepend_punctuations or append_punctuations sets. If a token matches a prepend character, it attaches to the next word's text and adopts its timestamp; if it matches an append character, it attaches to the previous word. This ensures punctuation inherits the timing of the word it modifies rather than floating between words with separate timestamps.

Why are my word-level timestamps inaccurate for certain audio segments?

Inaccuracies typically occur during sections with heavy background noise, music, or overlapping speech where cross-attention weights become diffuse. The median filter and DTW algorithm in find_alignment may misalign tokens when the audio-to-text attention is ambiguous. Enabling hallucination_silence_threshold can help remove phantom words during silent gaps, but highly distorted audio may require manual alignment correction.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →