How to Enable and Configure Timestamp Generation in Whisper

Whisper automatically generates segment-level timestamps for every transcription, and you can activate word-level precision by passing word_timestamps=True to the transcribe() function or using the --word_timestamps CLI flag.

The openai/whisper repository provides a comprehensive timestamp generation pipeline that supports both default segment boundaries and fine-grained word-level alignment. By configuring the available parameters in whisper/transcribe.py and the utility writers in whisper/utils.py, you can produce publication-ready subtitles and structured metadata for video editing workflows.

Understanding Whisper's Timestamp Architecture

The timestamp generation system consists of several coordinated components across the codebase:

Component Role Source Location
transcribe() Core entry point that accepts the word_timestamps flag and orchestrates the pipeline. [whisper/transcribe.py](https://github.com/openai/whisper/blob/main/whisper/transcribe.py#L38-L52)
add_word_timestamps() Computes per-word timing using cross-attention patterns and dynamic time warping when word-level timestamps are enabled. [whisper/timing.py](https://github.com/openai/whisper/blob/main/whisper/timing.py#L279)
format_timestamp() Converts floating-point seconds into formatted strings like hh:mm:ss.xx for subtitle outputs. [whisper/utils.py](https://github.com/openai/whisper/blob/main/whisper/utils.py#L50-L68)
Subtitle Writers Concrete implementations (WriteSRT, WriteVTT) that render timestamps with configurable formatting options. [whisper/utils.py](https://github.com/openai/whisper/blob/main/whisper/utils.py#L30-L66)
get_writer() Factory function that returns the appropriate writer instance for formats including srt, vtt, tsv, json, or all. [whisper/utils.py](https://github.com/openai/whisper/blob/main/whisper/utils.py#L96-L104)
Tokenizer Defines special timestamp tokens (e.g., `< 0.00

Enabling Word-Level Timestamps

By default, Whisper returns segment-level timestamps (start and end fields for each spoken paragraph). To obtain word-level timestamps, you must explicitly enable the feature.

Python API Method

Pass word_timestamps=True to the transcribe() function. You can also control punctuation attachment using prepend_punctuations and append_punctuations:

from whisper import load_model, transcribe, get_writer

model = load_model("base")
audio_path = "example.wav"

result = transcribe(
    model,
    audio_path,
    word_timestamps=True,
    prepend_punctuations="\"'“([{-",
    append_punctuations="\"'.。,,!?!::”)]}、",
)

When enabled, the result dictionary contains a words list within each segment, with each entry providing start, end, and text fields computed by add_word_timestamps() in whisper/timing.py.

Command-Line Interface

Use the --word_timestamps flag. The CLI parser in whisper/__main__.py validates that word-related formatting options are only accepted when this flag is set:

whisper audio.mp3 \
  --model base \
  --word_timestamps \
  --output_format srt

Configuring Timestamp Output Formats

The subtitle writers expose several formatting knobs that control how timestamps appear in generated files:

  • always_include_hours – Forces the hh: prefix even when hours are zero (default for SRT).
  • decimal_marker – Specifies the separator between seconds and milliseconds (, for SRT, . for VTT).
  • highlight_words – Wraps each word in <u> tags when generating SRT or VTT outputs.
  • max_line_width, max_line_count, max_words_per_line – Controls line-breaking logic in SubtitlesWriter.iterate_result() (lines 32-40 and 200-221 in whisper/utils.py).

Set these via the writer class attributes before calling the writer:

srt_writer = get_writer("srt", output_dir="out")
srt_writer.always_include_hours = True
srt_writer.decimal_marker = ","
srt_writer(result, audio_path)

Processing Specific Time Ranges

To generate timestamps for only a portion of the audio, use the clip_timestamps parameter. In the Python API, pass a string formatted as start,end. In the CLI, use --clip_timestamps:

whisper audio.wav \
  --model small \
  --output_format vtt \
  --clip_timestamps "30,45"

The transcribe() function parses this string (lines 68-78 in whisper/transcribe.py) and converts the values to frame indices, producing timestamps relative to the original audio timeline.

Practical Implementation Examples

Example 1: Python API with Full Word-Level Control

from whisper import load_model, transcribe, get_writer

model = load_model("base")
audio_path = "interview.wav"

# Enable word timestamps with custom punctuation handling

result = transcribe(
    model,
    audio_path,
    word_timestamps=True,
    prepend_punctuations="\"'“([{-",
    append_punctuations="\"'.。,,!?!::”)]}、",
)

# Configure SRT writer with comma decimal marker and hour field

srt_writer = get_writer("srt", output_dir="./subtitles")
srt_writer.always_include_hours = True
srt_writer.decimal_marker = ","
srt_writer(result, audio_path)

Example 2: CLI with Highlighting and Line Constraints

whisper podcast.mp3 \
  --model medium \
  --output_dir ./output \
  --output_format srt \
  --word_timestamps \
  --highlight_words \
  --max_line_width 42 \
  --max_line_count 2

This command activates word-level timestamps, underlines each word in the SRT file, and restricts subtitles to two lines of maximum 42 characters each.

Example 3: Extracting a Specific Clip with VTT Output

import whisper

model = whisper.load_model("base")
result = model.transcribe(
    "lecture.mp3",
    clip_timestamps="120,300",
    word_timestamps=True
)

writer = whisper.get_writer("vtt", "./clips")
writer(result, "lecture.mp3")

Summary

  • Whisper inherently produces segment-level timestamps (start and end) for every transcription without requiring special configuration.
  • Word-level timestamps require explicit activation via word_timestamps=True (API) or --word_timestamps (CLI), triggering the add_word_timestamps() algorithm in whisper/timing.py.
  • Formatting control is handled by writer classes in whisper/utils.py, supporting customization of decimal markers, hour fields, and line-breaking rules.
  • Clip extraction uses clip_timestamps to process specific audio intervals while maintaining timestamp alignment with the original source.
  • The entire pipeline flows from token generation in whisper/tokenizer.py through transcription logic in whisper/transcribe.py to final output formatting via format_timestamp() and writer classes.

Frequently Asked Questions

Does Whisper generate timestamps by default?

Yes. According to the source code in whisper/transcribe.py, every transcription result automatically includes start and end timestamps for each segment. These segment-level timestamps require no special flags and are always available in the output dictionary.

How do I get word-level timestamps in Whisper?

Pass word_timestamps=True to the transcribe() function in Python, or use the --word_timestamps flag on the command line. This invokes add_word_timestamps() in whisper/timing.py, which analyzes cross-attention weights to assign precise start and end times to individual words within each segment.

Can I customize the timestamp format in SRT files?

Yes. The WriteSRT class in whisper/utils.py exposes always_include_hours and decimal_marker attributes. Set always_include_hours=True to ensure hours appear in the timestamp, and set decimal_marker="," to use the standard SRT comma separator between seconds and milliseconds.

What is the purpose of the clip_timestamps parameter?

The clip_timestamps parameter allows you to transcribe only a specific time range of an audio file (e.g., "30,45" for seconds 30 to 45). As implemented in whisper/transcribe.py (lines 68-78), this feature converts the provided seconds into frame indices, processes only that portion of the audio, and returns timestamps relative to the original file's timeline.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →