whisper
Robust Speech Recognition via Large-Scale Weak Supervision
Extract word level timing in Whisper using the word_timestamps flag or API setting. Learn how to precisely align audio with text for accurate timing information.
How OpenAI Whisper Handles Audio Processing and Mel Spectrogram GenerationDiscover how OpenAI Whisper processes audio and generates mel spectrograms using its four-stage pipeline including ffmpeg loading STFT computation and Mel filterbank projection.
The Role of KV Caching in Whisper's Performance: Architecture and Implementation**KV caching reduces Whisper's autoregressive decoding complexity from O(T²) to O(T) by reusing previously computed key and value tensors across generation steps, eliminating redundant attention calculations during long audio t...
How to Enable and Customize Word-Level Timestamps in WhisperGet granular control over Whisper's output by learning how to enable and customize word-level timestamps using the Python API or CLI. Fine-tune punctuation and silence settings for precise transcriptions.
How to Use Whisper for Language Detection Without Transcription: 2 Methods Explained**You can detect the spoken language in an audio file using Whisper by calling `model.detect_language()` for a lightweight encoder-only check, or by setting `task="lang_id"` in `DecodingOptions` to use the high-level decoding A...
OpenAI Whisper model.transcribe() Result Dictionary Structure Explained**The `model.transcribe()` method returns a Python dictionary containing three top-level keys: `text` (the full transcription string), `segments` (a list of per-chunk dictionaries with timestamps and metadata), and `language` (...
How to Use `initial_prompt` and `condition_on_previous_text` for Context in OpenAI Whisper**Use `initial_prompt` to inject static text at the start of transcription, and enable `condition_on_previous_text` (default: True) to carry decoded output from previous audio windows into subsequent decoding steps for contextu...
Whisper model.transcribe() Advanced Parameters: Temperature, Thresholds, and Decoding OptionsExplore advanced Whisper model.transcribe() parameters like temperature and thresholds. Optimize transcription accuracy with detailed sampling and validation controls.
How Whisper model.transcribe() Works: A Deep Dive into the Transcription Pipeline**The `model.transcribe()` function is a high-level wrapper that orchestrates audio preprocessing, language detection, windowed decoding with temperature fallback, and timestamp extraction to convert speech into structured text...
How to Enable and Configure Timestamp Generation in Whisper**Whisper automatically generates segment-level timestamps for every transcription, and you can activate word-level precision by passing `word_timestamps=True` to the `transcribe()` function or using the `--word_timestamps` CLI...
How to Suppress Specific Tokens or Blank Outputs During Whisper DecodingLearn how to suppress specific tokens or blank outputs in Whisper decoding. Configure DecodingOptions with suppress_blank and suppress_tokens for cleaner results.
How to Configure Whisper for Greedy Decoding vs. Beam Search**To configure OpenAI Whisper for greedy decoding, set `temperature=0.0` and leave `beam_size=None` in the `DecodingOptions` dataclass; for beam search, provide an integer value (e.g., `5`) to `beam_size` and set `temperature=0...
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →