How to Enable and Customize Word-Level Timestamps in Whisper
Enable word-level timestamps by setting word_timestamps=True in the Python API or --word_timestamps true in the CLI, then customize behavior via prepend_punctuations, append_punctuations, and hallucination_silence_threshold parameters.
Whisper supports experimental word-level timestamps that pinpoint exactly when each word occurs in audio, going beyond standard segment-level timing. This feature aligns decoded tokens to the mel-spectrogram using the model's cross-attention weights and dynamic time warping (DTW). According to the OpenAI Whisper source code, the entry point resides in whisper/transcribe.py while the core alignment logic lives in whisper/timing.py.
How Word-Level Timestamps Work
When activated, Whisper performs a post-processing alignment step that maps each token to its corresponding audio frame.
Token-to-Audio Alignment with DTW
The find_alignment function in whisper/timing.py (lines 63-71) extracts cross-attention weights from every decoder layer using a forward hook. It normalizes these weights, applies a median filter to reduce noise, and runs a dynamic time warping (DTW) algorithm to align the token sequence against the mel-spectrogram frames. Each token receives a start and end time based on the DTW path, which are then grouped into words using the tokenizer's split_to_word_tokens method.
Punctuation Merging Logic
After alignment, the merge_punctuations function (lines 45-52 in whisper/timing.py) attaches punctuation marks to adjacent words. Leading punctuation (opening quotes, brackets) attaches to the following word, while trailing punctuation (periods, commas, closing brackets) attaches to the preceding word. This prevents isolated punctuation marks from receiving separate timestamps.
Enabling Word-Level Timestamps via Python API
In whisper/transcribe.py (lines 50-57), the transcribe function accepts a word_timestamps boolean parameter that defaults to False. When set to True, the function invokes add_word_timestamps after decoding to populate the "words" field in each segment.
import whisper
model = whisper.load_model("base")
result = whisper.transcribe(
model,
"audio.wav",
word_timestamps=True, # Enable the feature
prepend_punctuations="\"'‘“([{-", # Custom leading punctuation
append_punctuations="\"'.。,,!!??::”)]}、", # Custom trailing punctuation
hallucination_silence_threshold=0.5, # Skip silences > 0.5s
)
# Each segment contains a "words" list with precise timings
for word in result["segments"][0]["words"]:
print(f"{word['word']}: {word['start']:.2f}s - {word['end']:.2f}s")
The resulting word objects contain word, start, end, and probability keys, with timestamps in seconds.
Command-Line Interface Usage
The CLI parser in whisper/transcribe.py (lines 56-58) exposes the --word_timestamps flag. You can combine this with punctuation customization options and the hallucination threshold:
whisper audio.wav --model medium \
--word_timestamps true \
--prepend_punctuations "\"'‘“" \
--append_punctuations "!?,.;" \
--hallucination_silence_threshold 0.5
To generate karaoke-style subtitles with highlighted active words, add the --highlight_words true flag (validated in whisper/transcribe.py lines 104-108). This option requires word timestamps to function.
Customizing Timestamp Behavior
Configuring Punctuation Attachment
The default punctuation sets in whisper/transcribe.py (lines 51-53) define which characters merge with words:
- Prepend:
"\"'“¿([{-"(opening quotes, brackets) - Append:
"\"'.。,,!!??::”)]}、"(closing punctuation, CJK delimiters)
Override these via the prepend_punctuations and append_punctuations arguments to handle language-specific conventions or exclude certain marks from merging.
Hallucination Silence Threshold
The hallucination_silence_threshold parameter (implemented in whisper/timing.py lines 19-22) filters out timestamp segments that exceed a specified duration of silence. Setting this to 0.5 skips gaps longer than half a second, removing phantom words that often appear during musical interludes or background noise.
Summary
- Enable the feature by setting
word_timestamps=Truein Python or--word_timestamps truevia CLI. - Core implementation lives in
whisper/timing.py(add_word_timestamps,find_alignment,merge_punctuations). - Customize punctuation using
prepend_punctuationsandappend_punctuationsto control how quotes and brackets attach to words. - Filter hallucinations by setting
hallucination_silence_thresholdto skip implausibly long silent gaps. - Output format appears as a
"words"list within each segment dictionary, containingword,start,end, andprobabilityfields.
Frequently Asked Questions
What is the performance impact of enabling word-level timestamps?
Enabling word-level timestamps adds computational overhead because Whisper must extract cross-attention weights from all decoder layers and run the DTW algorithm on every segment. Expect roughly 10-20% slower transcription depending on audio length and model size, as the alignment process requires additional forward passes through the attention mechanism.
Can I use word-level timestamps with all Whisper model sizes?
Yes, word-level timestamps work with all Whisper model sizes (tiny, base, small, medium, large). The alignment quality generally improves with larger models because cross-attention weights more accurately correlate tokens with audio features, though the DTW algorithm in find_alignment functions identically across all variants.
How does the punctuation merging algorithm work?
The merge_punctuations function iterates through tokens and checks if they exist in the prepend_punctuations or append_punctuations sets. If a token matches a prepend character, it attaches to the next word's text and adopts its timestamp; if it matches an append character, it attaches to the previous word. This ensures punctuation inherits the timing of the word it modifies rather than floating between words with separate timestamps.
Why are my word-level timestamps inaccurate for certain audio segments?
Inaccuracies typically occur during sections with heavy background noise, music, or overlapping speech where cross-attention weights become diffuse. The median filter and DTW algorithm in find_alignment may misalign tokens when the audio-to-text attention is ambiguous. Enabling hallucination_silence_threshold can help remove phantom words during silent gaps, but highly distorted audio may require manual alignment correction.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →