# How to Enable and Customize Word-Level Timestamps in Whisper

> Get granular control over Whisper's output by learning how to enable and customize word-level timestamps using the Python API or CLI. Fine-tune punctuation and silence settings for precise transcriptions.

- Repository: [OpenAI/whisper](https://github.com/openai/whisper)
- Tags: how-to-guide
- Published: 2026-02-27

---

**Enable word-level timestamps by setting `word_timestamps=True` in the Python API or `--word_timestamps true` in the CLI, then customize behavior via `prepend_punctuations`, `append_punctuations`, and `hallucination_silence_threshold` parameters.**

Whisper supports experimental word-level timestamps that pinpoint exactly when each word occurs in audio, going beyond standard segment-level timing. This feature aligns decoded tokens to the mel-spectrogram using the model's cross-attention weights and dynamic time warping (DTW). According to the OpenAI Whisper source code, the entry point resides in [`whisper/transcribe.py`](https://github.com/openai/whisper/blob/main/whisper/transcribe.py) while the core alignment logic lives in [`whisper/timing.py`](https://github.com/openai/whisper/blob/main/whisper/timing.py).

## How Word-Level Timestamps Work

When activated, Whisper performs a post-processing alignment step that maps each token to its corresponding audio frame.

### Token-to-Audio Alignment with DTW

The `find_alignment` function in [`whisper/timing.py`](https://github.com/openai/whisper/blob/main/whisper/timing.py) (lines 63-71) extracts cross-attention weights from every decoder layer using a forward hook. It normalizes these weights, applies a **median filter** to reduce noise, and runs a **dynamic time warping (DTW)** algorithm to align the token sequence against the mel-spectrogram frames. Each token receives a start and end time based on the DTW path, which are then grouped into words using the tokenizer's `split_to_word_tokens` method.

### Punctuation Merging Logic

After alignment, the `merge_punctuations` function (lines 45-52 in [`whisper/timing.py`](https://github.com/openai/whisper/blob/main/whisper/timing.py)) attaches punctuation marks to adjacent words. Leading punctuation (opening quotes, brackets) attaches to the following word, while trailing punctuation (periods, commas, closing brackets) attaches to the preceding word. This prevents isolated punctuation marks from receiving separate timestamps.

## Enabling Word-Level Timestamps via Python API

In [`whisper/transcribe.py`](https://github.com/openai/whisper/blob/main/whisper/transcribe.py) (lines 50-57), the `transcribe` function accepts a `word_timestamps` boolean parameter that defaults to `False`. When set to `True`, the function invokes `add_word_timestamps` after decoding to populate the `"words"` field in each segment.

```python
import whisper

model = whisper.load_model("base")
result = whisper.transcribe(
    model,
    "audio.wav",
    word_timestamps=True,                       # Enable the feature

    prepend_punctuations="\"'‘“([{-",           # Custom leading punctuation

    append_punctuations="\"'.。,，!！?？:：”)]}、", # Custom trailing punctuation

    hallucination_silence_threshold=0.5,        # Skip silences > 0.5s

)

# Each segment contains a "words" list with precise timings

for word in result["segments"][0]["words"]:
    print(f"{word['word']}: {word['start']:.2f}s - {word['end']:.2f}s")

```

The resulting word objects contain `word`, `start`, `end`, and `probability` keys, with timestamps in seconds.

## Command-Line Interface Usage

The CLI parser in [`whisper/transcribe.py`](https://github.com/openai/whisper/blob/main/whisper/transcribe.py) (lines 56-58) exposes the `--word_timestamps` flag. You can combine this with punctuation customization options and the hallucination threshold:

```bash
whisper audio.wav --model medium \
    --word_timestamps true \
    --prepend_punctuations "\"'‘“" \
    --append_punctuations "!?,.;" \
    --hallucination_silence_threshold 0.5

```

To generate karaoke-style subtitles with highlighted active words, add the `--highlight_words true` flag (validated in [`whisper/transcribe.py`](https://github.com/openai/whisper/blob/main/whisper/transcribe.py) lines 104-108). This option requires word timestamps to function.

## Customizing Timestamp Behavior

### Configuring Punctuation Attachment

The default punctuation sets in [`whisper/transcribe.py`](https://github.com/openai/whisper/blob/main/whisper/transcribe.py) (lines 51-53) define which characters merge with words:

- **Prepend**: `"\"'“¿([{-"` (opening quotes, brackets)
- **Append**: `"\"'.。,，!！?？:：”)]}、"` (closing punctuation, CJK delimiters)

Override these via the `prepend_punctuations` and `append_punctuations` arguments to handle language-specific conventions or exclude certain marks from merging.

### Hallucination Silence Threshold

The `hallucination_silence_threshold` parameter (implemented in [`whisper/timing.py`](https://github.com/openai/whisper/blob/main/whisper/timing.py) lines 19-22) filters out timestamp segments that exceed a specified duration of silence. Setting this to `0.5` skips gaps longer than half a second, removing phantom words that often appear during musical interludes or background noise.

## Summary

- **Enable the feature** by setting `word_timestamps=True` in Python or `--word_timestamps true` via CLI.
- **Core implementation** lives in [`whisper/timing.py`](https://github.com/openai/whisper/blob/main/whisper/timing.py) (`add_word_timestamps`, `find_alignment`, `merge_punctuations`).
- **Customize punctuation** using `prepend_punctuations` and `append_punctuations` to control how quotes and brackets attach to words.
- **Filter hallucinations** by setting `hallucination_silence_threshold` to skip implausibly long silent gaps.
- **Output format** appears as a `"words"` list within each segment dictionary, containing `word`, `start`, `end`, and `probability` fields.

## Frequently Asked Questions

### What is the performance impact of enabling word-level timestamps?

Enabling word-level timestamps adds computational overhead because Whisper must extract cross-attention weights from all decoder layers and run the DTW algorithm on every segment. Expect roughly 10-20% slower transcription depending on audio length and model size, as the alignment process requires additional forward passes through the attention mechanism.

### Can I use word-level timestamps with all Whisper model sizes?

Yes, word-level timestamps work with all Whisper model sizes (tiny, base, small, medium, large). The alignment quality generally improves with larger models because cross-attention weights more accurately correlate tokens with audio features, though the DTW algorithm in `find_alignment` functions identically across all variants.

### How does the punctuation merging algorithm work?

The `merge_punctuations` function iterates through tokens and checks if they exist in the `prepend_punctuations` or `append_punctuations` sets. If a token matches a prepend character, it attaches to the next word's text and adopts its timestamp; if it matches an append character, it attaches to the previous word. This ensures punctuation inherits the timing of the word it modifies rather than floating between words with separate timestamps.

### Why are my word-level timestamps inaccurate for certain audio segments?

Inaccuracies typically occur during sections with heavy background noise, music, or overlapping speech where cross-attention weights become diffuse. The median filter and DTW algorithm in `find_alignment` may misalign tokens when the audio-to-text attention is ambiguous. Enabling `hallucination_silence_threshold` can help remove phantom words during silent gaps, but highly distorted audio may require manual alignment correction.