How Slide Extraction and OCR Work for YouTube Videos in Summarize
Summarize converts YouTube videos into searchable slide thumbnails using ffmpeg scene detection and optional Tesseract OCR, implemented across the src/slides/ package.
The steipete/summarize repository provides a complete pipeline for extracting visual slides from YouTube videos and performing optical character recognition (OCR) to make on-screen text searchable. This article breaks down the technical implementation, from video download to final JSON persistence.
The Slide Extraction Pipeline
The core orchestration happens in extractSlidesForSource within src/slides/extract.ts. This function coordinates ten distinct steps to transform a YouTube URL into a collection of processed slide images.
Resolving the Video Source
The process begins in src/slides/index.ts with resolveSlideSource. This function parses the input URL, extracts the YouTube video ID, and constructs a SlideSource object that carries metadata throughout the pipeline.
Downloading YouTube Content
For slide extraction, the system must obtain the video file. The downloadYoutubeVideo function (line 331 in src/slides/extract.ts) spawns yt-dlp to fetch an MP4 file. If direct download is unavailable, it falls back to resolveYoutubeStreamUrl (line 348) to obtain a streamable URL.
Scene Detection and Timestamp Calibration
Before extracting frames, the system calibrates detection parameters. The probeVideoInfo function uses ffprobe to obtain duration and bitrate. Then calibrateSceneThreshold (line 1009) runs a short ffmpeg pass to determine an optimal scene-change threshold. If --slides-auto-tune is enabled, this threshold is stored in the result metadata.
The actual timestamp detection happens in detectSlideTimestamps (line 966), which calls detectSceneTimestamps (line 1085). This runs ffmpeg with -vf select='gt(scene,THRESH)' across parallel video segments to identify sharp visual changes that indicate slide boundaries.
Frame Extraction and Quality Adjustment
Once timestamps are finalized through buildIntervalTimestamps, filterTimestampsByMinDuration, and applyMaxSlidesFilter, the system extracts frames. The extractFramesAtTimestamps function (line 1118) iterates through timestamps, calling extractFrame (lines 665-720) for each.
The extractFrame function runs ffmpeg with -vframes 1 and applies signalstats,showinfo filters to capture brightness and contrast statistics. If a frame is too dark or low-contrast, the system executes a quality adjustment routine (starting at line 880) that searches a ±10 second window in 2-second steps, re-extracts alternative frames, and selects the best candidate based on a quality score.
OCR Processing for Text Extraction
When the --slides-ocr flag is provided, the pipeline extends to perform optical character recognition on every extracted slide.
Running Tesseract on Extracted Frames
The runOcrOnSlides function (line 2000 in src/slides/extract.ts) spawns tesseract processes in parallel for each PNG file. This operation requires tesseract to be available on the system PATH.
Confidence Estimation and Result Storage
After OCR completes, estimateOcrConfidence evaluates the text quality. The final SlideExtractionResult object, containing slide metadata, OCR text, confidence estimates, and any auto-tuning information, is persisted to slides.json via writeSlidesJson (line 1660).
Implementation Details and Key Files
The slide extraction system spans several modules within the src/slides/ directory:
| File | Responsibility |
|---|---|
src/slides/index.ts |
Public API entry points: resolveSlideSource, extractSlidesForSource. |
src/slides/extract.ts |
Core pipeline implementation including download, scene detection, frame extraction, thumbnail adjustment, and OCR orchestration. |
src/slides/types.ts |
TypeScript definitions for SlideImage, SlideExtractionResult, and related interfaces. |
src/slides/store.ts |
Cache handling, slide-directory ID generation, and slide-JSON persistence logic. |
src/run/slides-cli.ts |
Command-line interface parsing for --slides and --slides-ocr flags. |
src/run/flows/url/slides-output.ts |
Transforms extraction results for downstream UI rendering. |
Usage Examples
Extract slides from a YouTube video using the CLI:
# Basic slide extraction (thumbnails only)
summarize "https://www.youtube.com/watch?v=abcd1234" --slides
# Slide extraction + OCR (requires tesseract on $PATH)
summarize "https://youtu.be/abcd1234" --slides --slides-ocr
Programmatic usage via the core library:
import { extractSlidesForSource } from "@steipete/summarize-core";
import { resolveSlideSource } from "@steipete/summarize-core/slides";
const source = resolveSlideSource({
url: "https://www.youtube.com/watch?v=abcd1234",
extracted: {}
});
const settings = {
slides: true,
slidesOcr: true,
cwd: "/tmp/my-summaries"
};
const result = await extractSlidesForSource({
source,
settings,
env: process.env,
timeoutMs: 300_000,
ytDlpPath: "/usr/local/bin/yt-dlp",
ffmpegPath: "/usr/bin/ffmpeg",
tesseractPath: "/usr/bin/tesseract",
});
console.log(result.slides.map(s =>
`${s.index}: ${s.imagePath} → ${s.ocrText ?? "no OCR"}`
));
Summary
- Summarize extracts slides from YouTube videos using a multi-stage pipeline implemented in
src/slides/extract.ts. - The system uses yt-dlp for video acquisition and ffmpeg with scene detection filters (
select='gt(scene,THRESH)') to identify slide boundaries. - Tesseract OCR processes extracted PNG frames when
--slides-ocris enabled, making on-screen text searchable. - Quality adjustment algorithms automatically re-extract frames if initial captures are too dark or low-contrast.
- Results are persisted to
slides.jsonwith metadata including timestamps, image paths, and OCR confidence scores.
Frequently Asked Questions
What external tools does Summarize require for slide extraction?
Summarize relies on three external binaries that must be available on your system PATH: yt-dlp for downloading YouTube videos, ffmpeg (and ffprobe) for video processing and scene detection, and optionally tesseract for OCR functionality. The extractSlidesForSource function accepts explicit paths to these binaries via ytDlpPath, ffmpegPath, and tesseractPath parameters.
How does the scene detection threshold calibration work?
The calibrateSceneThreshold function in src/slides/extract.ts (line 1009) performs a preliminary ffmpeg pass on a sample of the video to statistically determine an optimal threshold for the select='gt(scene,THRESH)' filter. When users enable --slides-auto-tune, this calibrated threshold is stored in the final SlideExtractionResult and used instead of default values, improving detection accuracy for videos with varying transition speeds.
Can I extract slides without running OCR?
Yes, slide extraction and OCR are independent operations controlled by separate flags. Using --slides alone triggers the full pipeline through frame extraction and quality adjustment, but skips the runOcrOnSlides function. Only when --slides-ocr is additionally specified does the system spawn tesseract processes on the extracted PNG files. This modular design allows users to generate visual summaries quickly while deferring text recognition to later stages if needed.
Where are the extracted slide images and OCR results stored?
All extraction outputs are organized in a dedicated directory managed by src/slides/store.ts. The writeSlidesJson function (line 1660 in extract.ts) persists a slides.json file containing the SlideExtractionResult object, which includes metadata for each slide: timestamp, image file path, brightness statistics, and OCR text with confidence scores. The actual PNG images are stored alongside this JSON file in the configured working directory (cwd parameter).
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →