How to Enable and Configure Timestamp Generation in Whisper
Whisper automatically generates segment-level timestamps for every transcription, and you can activate word-level precision by passing word_timestamps=True to the transcribe() function or using the --word_timestamps CLI flag.
The openai/whisper repository provides a comprehensive timestamp generation pipeline that supports both default segment boundaries and fine-grained word-level alignment. By configuring the available parameters in whisper/transcribe.py and the utility writers in whisper/utils.py, you can produce publication-ready subtitles and structured metadata for video editing workflows.
Understanding Whisper's Timestamp Architecture
The timestamp generation system consists of several coordinated components across the codebase:
| Component | Role | Source Location |
|---|---|---|
transcribe() |
Core entry point that accepts the word_timestamps flag and orchestrates the pipeline. |
[whisper/transcribe.py](https://github.com/openai/whisper/blob/main/whisper/transcribe.py#L38-L52) |
add_word_timestamps() |
Computes per-word timing using cross-attention patterns and dynamic time warping when word-level timestamps are enabled. | [whisper/timing.py](https://github.com/openai/whisper/blob/main/whisper/timing.py#L279) |
format_timestamp() |
Converts floating-point seconds into formatted strings like hh:mm:ss.xx for subtitle outputs. |
[whisper/utils.py](https://github.com/openai/whisper/blob/main/whisper/utils.py#L50-L68) |
| Subtitle Writers | Concrete implementations (WriteSRT, WriteVTT) that render timestamps with configurable formatting options. |
[whisper/utils.py](https://github.com/openai/whisper/blob/main/whisper/utils.py#L30-L66) |
get_writer() |
Factory function that returns the appropriate writer instance for formats including srt, vtt, tsv, json, or all. |
[whisper/utils.py](https://github.com/openai/whisper/blob/main/whisper/utils.py#L96-L104) |
| Tokenizer | Defines special timestamp tokens (e.g., `< | 0.00 |
Enabling Word-Level Timestamps
By default, Whisper returns segment-level timestamps (start and end fields for each spoken paragraph). To obtain word-level timestamps, you must explicitly enable the feature.
Python API Method
Pass word_timestamps=True to the transcribe() function. You can also control punctuation attachment using prepend_punctuations and append_punctuations:
from whisper import load_model, transcribe, get_writer
model = load_model("base")
audio_path = "example.wav"
result = transcribe(
model,
audio_path,
word_timestamps=True,
prepend_punctuations="\"'“([{-",
append_punctuations="\"'.。,,!?!::”)]}、",
)
When enabled, the result dictionary contains a words list within each segment, with each entry providing start, end, and text fields computed by add_word_timestamps() in whisper/timing.py.
Command-Line Interface
Use the --word_timestamps flag. The CLI parser in whisper/__main__.py validates that word-related formatting options are only accepted when this flag is set:
whisper audio.mp3 \
--model base \
--word_timestamps \
--output_format srt
Configuring Timestamp Output Formats
The subtitle writers expose several formatting knobs that control how timestamps appear in generated files:
always_include_hours– Forces thehh:prefix even when hours are zero (default for SRT).decimal_marker– Specifies the separator between seconds and milliseconds (,for SRT,.for VTT).highlight_words– Wraps each word in<u>tags when generating SRT or VTT outputs.max_line_width,max_line_count,max_words_per_line– Controls line-breaking logic inSubtitlesWriter.iterate_result()(lines 32-40 and 200-221 inwhisper/utils.py).
Set these via the writer class attributes before calling the writer:
srt_writer = get_writer("srt", output_dir="out")
srt_writer.always_include_hours = True
srt_writer.decimal_marker = ","
srt_writer(result, audio_path)
Processing Specific Time Ranges
To generate timestamps for only a portion of the audio, use the clip_timestamps parameter. In the Python API, pass a string formatted as start,end. In the CLI, use --clip_timestamps:
whisper audio.wav \
--model small \
--output_format vtt \
--clip_timestamps "30,45"
The transcribe() function parses this string (lines 68-78 in whisper/transcribe.py) and converts the values to frame indices, producing timestamps relative to the original audio timeline.
Practical Implementation Examples
Example 1: Python API with Full Word-Level Control
from whisper import load_model, transcribe, get_writer
model = load_model("base")
audio_path = "interview.wav"
# Enable word timestamps with custom punctuation handling
result = transcribe(
model,
audio_path,
word_timestamps=True,
prepend_punctuations="\"'“([{-",
append_punctuations="\"'.。,,!?!::”)]}、",
)
# Configure SRT writer with comma decimal marker and hour field
srt_writer = get_writer("srt", output_dir="./subtitles")
srt_writer.always_include_hours = True
srt_writer.decimal_marker = ","
srt_writer(result, audio_path)
Example 2: CLI with Highlighting and Line Constraints
whisper podcast.mp3 \
--model medium \
--output_dir ./output \
--output_format srt \
--word_timestamps \
--highlight_words \
--max_line_width 42 \
--max_line_count 2
This command activates word-level timestamps, underlines each word in the SRT file, and restricts subtitles to two lines of maximum 42 characters each.
Example 3: Extracting a Specific Clip with VTT Output
import whisper
model = whisper.load_model("base")
result = model.transcribe(
"lecture.mp3",
clip_timestamps="120,300",
word_timestamps=True
)
writer = whisper.get_writer("vtt", "./clips")
writer(result, "lecture.mp3")
Summary
- Whisper inherently produces segment-level timestamps (
startandend) for every transcription without requiring special configuration. - Word-level timestamps require explicit activation via
word_timestamps=True(API) or--word_timestamps(CLI), triggering theadd_word_timestamps()algorithm inwhisper/timing.py. - Formatting control is handled by writer classes in
whisper/utils.py, supporting customization of decimal markers, hour fields, and line-breaking rules. - Clip extraction uses
clip_timestampsto process specific audio intervals while maintaining timestamp alignment with the original source. - The entire pipeline flows from token generation in
whisper/tokenizer.pythrough transcription logic inwhisper/transcribe.pyto final output formatting viaformat_timestamp()and writer classes.
Frequently Asked Questions
Does Whisper generate timestamps by default?
Yes. According to the source code in whisper/transcribe.py, every transcription result automatically includes start and end timestamps for each segment. These segment-level timestamps require no special flags and are always available in the output dictionary.
How do I get word-level timestamps in Whisper?
Pass word_timestamps=True to the transcribe() function in Python, or use the --word_timestamps flag on the command line. This invokes add_word_timestamps() in whisper/timing.py, which analyzes cross-attention weights to assign precise start and end times to individual words within each segment.
Can I customize the timestamp format in SRT files?
Yes. The WriteSRT class in whisper/utils.py exposes always_include_hours and decimal_marker attributes. Set always_include_hours=True to ensure hours appear in the timestamp, and set decimal_marker="," to use the standard SRT comma separator between seconds and milliseconds.
What is the purpose of the clip_timestamps parameter?
The clip_timestamps parameter allows you to transcribe only a specific time range of an audio file (e.g., "30,45" for seconds 30 to 45). As implemented in whisper/transcribe.py (lines 68-78), this feature converts the provided seconds into frame indices, processes only that portion of the audio, and returns timestamps relative to the original file's timeline.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →