# How Slide Extraction and OCR Work for YouTube Videos in Summarize

> Discover how Summarize's slide extraction and OCR process transforms YouTube videos into searchable thumbnails using ffmpeg and Tesseract OCR.

- Repository: [Peter Steinberger/summarize](https://github.com/steipete/summarize)
- Tags: internals
- Published: 2026-02-19

---

**Summarize converts YouTube videos into searchable slide thumbnails using ffmpeg scene detection and optional Tesseract OCR, implemented across the `src/slides/` package.**

The `steipete/summarize` repository provides a complete pipeline for extracting visual slides from YouTube videos and performing optical character recognition (OCR) to make on-screen text searchable. This article breaks down the technical implementation, from video download to final JSON persistence.

## The Slide Extraction Pipeline

The core orchestration happens in `extractSlidesForSource` within [`src/slides/extract.ts`](https://github.com/steipete/summarize/blob/main/src/slides/extract.ts). This function coordinates ten distinct steps to transform a YouTube URL into a collection of processed slide images.

### Resolving the Video Source

The process begins in [`src/slides/index.ts`](https://github.com/steipete/summarize/blob/main/src/slides/index.ts) with `resolveSlideSource`. This function parses the input URL, extracts the YouTube video ID, and constructs a `SlideSource` object that carries metadata throughout the pipeline.

### Downloading YouTube Content

For slide extraction, the system must obtain the video file. The `downloadYoutubeVideo` function (line 331 in [`src/slides/extract.ts`](https://github.com/steipete/summarize/blob/main/src/slides/extract.ts)) spawns **yt-dlp** to fetch an MP4 file. If direct download is unavailable, it falls back to `resolveYoutubeStreamUrl` (line 348) to obtain a streamable URL.

### Scene Detection and Timestamp Calibration

Before extracting frames, the system calibrates detection parameters. The `probeVideoInfo` function uses **ffprobe** to obtain duration and bitrate. Then `calibrateSceneThreshold` (line 1009) runs a short ffmpeg pass to determine an optimal scene-change threshold. If `--slides-auto-tune` is enabled, this threshold is stored in the result metadata.

The actual timestamp detection happens in `detectSlideTimestamps` (line 966), which calls `detectSceneTimestamps` (line 1085). This runs ffmpeg with `-vf select='gt(scene,THRESH)'` across parallel video segments to identify sharp visual changes that indicate slide boundaries.

### Frame Extraction and Quality Adjustment

Once timestamps are finalized through `buildIntervalTimestamps`, `filterTimestampsByMinDuration`, and `applyMaxSlidesFilter`, the system extracts frames. The `extractFramesAtTimestamps` function (line 1118) iterates through timestamps, calling `extractFrame` (lines 665-720) for each.

The `extractFrame` function runs ffmpeg with `-vframes 1` and applies `signalstats,showinfo` filters to capture brightness and contrast statistics. If a frame is too dark or low-contrast, the system executes a quality adjustment routine (starting at line 880) that searches a ±10 second window in 2-second steps, re-extracts alternative frames, and selects the best candidate based on a quality score.

## OCR Processing for Text Extraction

When the `--slides-ocr` flag is provided, the pipeline extends to perform optical character recognition on every extracted slide.

### Running Tesseract on Extracted Frames

The `runOcrOnSlides` function (line 2000 in [`src/slides/extract.ts`](https://github.com/steipete/summarize/blob/main/src/slides/extract.ts)) spawns **tesseract** processes in parallel for each PNG file. This operation requires tesseract to be available on the system PATH.

### Confidence Estimation and Result Storage

After OCR completes, `estimateOcrConfidence` evaluates the text quality. The final `SlideExtractionResult` object, containing slide metadata, OCR text, confidence estimates, and any auto-tuning information, is persisted to [`slides.json`](https://github.com/steipete/summarize/blob/main/slides.json) via `writeSlidesJson` (line 1660).

## Implementation Details and Key Files

The slide extraction system spans several modules within the `src/slides/` directory:

| File | Responsibility |
|------|----------------|
| [`src/slides/index.ts`](https://github.com/steipete/summarize/blob/main/src/slides/index.ts) | Public API entry points: `resolveSlideSource`, `extractSlidesForSource`. |
| [`src/slides/extract.ts`](https://github.com/steipete/summarize/blob/main/src/slides/extract.ts) | Core pipeline implementation including download, scene detection, frame extraction, thumbnail adjustment, and OCR orchestration. |
| [`src/slides/types.ts`](https://github.com/steipete/summarize/blob/main/src/slides/types.ts) | TypeScript definitions for `SlideImage`, `SlideExtractionResult`, and related interfaces. |
| [`src/slides/store.ts`](https://github.com/steipete/summarize/blob/main/src/slides/store.ts) | Cache handling, slide-directory ID generation, and slide-JSON persistence logic. |
| [`src/run/slides-cli.ts`](https://github.com/steipete/summarize/blob/main/src/run/slides-cli.ts) | Command-line interface parsing for `--slides` and `--slides-ocr` flags. |
| [`src/run/flows/url/slides-output.ts`](https://github.com/steipete/summarize/blob/main/src/run/flows/url/slides-output.ts) | Transforms extraction results for downstream UI rendering. |

## Usage Examples

Extract slides from a YouTube video using the CLI:

```bash

# Basic slide extraction (thumbnails only)

summarize "https://www.youtube.com/watch?v=abcd1234" --slides

# Slide extraction + OCR (requires tesseract on $PATH)

summarize "https://youtu.be/abcd1234" --slides --slides-ocr

```

Programmatic usage via the core library:

```typescript
import { extractSlidesForSource } from "@steipete/summarize-core";
import { resolveSlideSource } from "@steipete/summarize-core/slides";

const source = resolveSlideSource({ 
  url: "https://www.youtube.com/watch?v=abcd1234", 
  extracted: {} 
});

const settings = { 
  slides: true, 
  slidesOcr: true, 
  cwd: "/tmp/my-summaries" 
};

const result = await extractSlidesForSource({
  source,
  settings,
  env: process.env,
  timeoutMs: 300_000,
  ytDlpPath: "/usr/local/bin/yt-dlp",
  ffmpegPath: "/usr/bin/ffmpeg",
  tesseractPath: "/usr/bin/tesseract",
});

console.log(result.slides.map(s => 
  `${s.index}: ${s.imagePath} → ${s.ocrText ?? "no OCR"}`
));

```

## Summary

- **Summarize** extracts slides from YouTube videos using a multi-stage pipeline implemented in [`src/slides/extract.ts`](https://github.com/steipete/summarize/blob/main/src/slides/extract.ts).
- The system uses **yt-dlp** for video acquisition and **ffmpeg** with scene detection filters (`select='gt(scene,THRESH)'`) to identify slide boundaries.
- **Tesseract OCR** processes extracted PNG frames when `--slides-ocr` is enabled, making on-screen text searchable.
- Quality adjustment algorithms automatically re-extract frames if initial captures are too dark or low-contrast.
- Results are persisted to [`slides.json`](https://github.com/steipete/summarize/blob/main/slides.json) with metadata including timestamps, image paths, and OCR confidence scores.

## Frequently Asked Questions

### What external tools does Summarize require for slide extraction?

Summarize relies on three external binaries that must be available on your system PATH: **yt-dlp** for downloading YouTube videos, **ffmpeg** (and ffprobe) for video processing and scene detection, and optionally **tesseract** for OCR functionality. The `extractSlidesForSource` function accepts explicit paths to these binaries via `ytDlpPath`, `ffmpegPath`, and `tesseractPath` parameters.

### How does the scene detection threshold calibration work?

The `calibrateSceneThreshold` function in [`src/slides/extract.ts`](https://github.com/steipete/summarize/blob/main/src/slides/extract.ts) (line 1009) performs a preliminary ffmpeg pass on a sample of the video to statistically determine an optimal threshold for the `select='gt(scene,THRESH)'` filter. When users enable `--slides-auto-tune`, this calibrated threshold is stored in the final `SlideExtractionResult` and used instead of default values, improving detection accuracy for videos with varying transition speeds.

### Can I extract slides without running OCR?

Yes, slide extraction and OCR are independent operations controlled by separate flags. Using `--slides` alone triggers the full pipeline through frame extraction and quality adjustment, but skips the `runOcrOnSlides` function. Only when `--slides-ocr` is additionally specified does the system spawn tesseract processes on the extracted PNG files. This modular design allows users to generate visual summaries quickly while deferring text recognition to later stages if needed.

### Where are the extracted slide images and OCR results stored?

All extraction outputs are organized in a dedicated directory managed by [`src/slides/store.ts`](https://github.com/steipete/summarize/blob/main/src/slides/store.ts). The `writeSlidesJson` function (line 1660 in [`extract.ts`](https://github.com/steipete/summarize/blob/main/extract.ts)) persists a [`slides.json`](https://github.com/steipete/summarize/blob/main/slides.json) file containing the `SlideExtractionResult` object, which includes metadata for each slide: timestamp, image file path, brightness statistics, and OCR text with confidence scores. The actual PNG images are stored alongside this JSON file in the configured working directory (`cwd` parameter).