# Summarize Video Mode Options: Auto, Transcript, and Understand Explained

> Explore steipete/summarize video modes auto transcript and understand. Learn when to use each to extract text from audio or send video to LLMs for efficient content processing.

- Repository: [Peter Steinberger/summarize](https://github.com/steipete/summarize)
- Tags: deep-dive
- Published: 2026-02-19

---

**The `summarize` CLI offers three video processing modes—`auto` (default), `transcript`, and `understand`—that determine whether the tool extracts text from audio or sends raw video files to vision-capable LLMs.**

When processing video URLs with the `steipete/summarize` repository, the **`--video-mode`** flag controls how the pipeline handles multimedia input. The option type is defined in [`src/flags.ts`](https://github.com/steipete/summarize/blob/main/src/flags.ts) and integrates with the runner configuration to decide between transcription and native video understanding.

## The Three Video Processing Modes

### Auto Mode (Default)

**`auto`** is the default behavior that prefers *video understanding* when the selected LLM model supports video attachments. If the model cannot process video files—such as most OpenAI or Anthropic models—the runner quietly falls back to *transcript* mode.

Use this mode when you want the best possible result without manually checking model capabilities. The pipeline automatically switches strategies based on the `model.apiStatus.googleConfigured` check and transport availability in [`src/run/flows/url/flow.ts`](https://github.com/steipete/summarize/blob/main/src/run/flows/url/flow.ts).

### Transcript Mode

**`transcript`** forces the pipeline to extract or fetch a text transcript from the video source and passes only that text to the LLM. In this mode, no video data is sent to the model.

Choose this option when you know your target model lacks vision capabilities, when you need deterministic text-only summarization for cost control, or when processing long videos where only the spoken content matters. This mode sets `mediaTranscript: "prefer"` in the request configuration.

### Understand Mode

**`understand`** strictly requires the model to process the raw video file directly and aborts with an error if the model cannot handle video inputs. Unlike `auto`, this mode never falls back to transcription.

Use this when visual information—such as slides, on-screen text, or visual cues—is critical to the summary. Currently, this requires Google Gemini models (e.g., Gemini-Pro-Vision) that support native video transport. The runner verifies this capability through the `wantsVideoUnderstanding` boolean flag combined with model kind checks before calling `loadRemoteAsset`.

## How the Runner Decides

The decision logic flows through three key stages in the codebase:

1. **Parsing** – The `parseVideoMode` function in [`src/flags.ts`](https://github.com/steipete/summarize/blob/main/src/flags.ts) validates the CLI input against the three allowed strings and throws on invalid values.

2. **Configuration Merging** – The resolved value lands in `RunConfig.videoMode` via [`src/run/run-config.ts`](https://github.com/steipete/summarize/blob/main/src/run/run-config.ts), merging CLI flags, config file settings (`media.videoMode`), or defaulting to `"auto"`.

3. **Execution Branching** – In [`src/run/flows/url/flow.ts`](https://github.com/steipete/summarize/blob/main/src/run/flows/url/flow.ts), the runner evaluates `flags.videoMode` alongside model capabilities:
   - If `videoMode === "understand"` or `"auto"` and the model supports native video transport, the runner downloads the video and attaches it to the LLM request.
   - If the model lacks video support, `auto` silently falls back to transcript extraction, while `understand` throws an error.
   - The `transcript` mode bypasses video download entirely, relying on caption extraction or `yt-dlp` fallback.

Additional checks in [`src/run/flows/asset/summary.ts`](https://github.com/steipete/summarize/blob/main/src/run/flows/asset/summary.ts) verify `ctx.videoMode !== "transcript"` before enabling video-understanding code paths.

## Usage Examples

### Command Line Interface

```bash

# Let the runner pick the best available method (default)

summarize "https://www.youtube.com/watch?v=abc123" --video-mode auto

# Force text extraction only, no video upload

summarize "https://example.com/video.mp4" --video-mode transcript

# Require native video understanding (Gemini only)

summarize "https://example.com/video.mp4" --video-mode understand

```

### Programmatic API

```typescript
import { Summarizer } from "@steipete/summarize-core";

await Summarizer.run({
  input: "https://example.com/video.mp4",
  videoMode: "understand", // "auto" | "transcript" | "understand"
});

```

## Summary

- **`auto`** intelligently selects between video understanding and transcription based on model capabilities, providing a hands-off experience.
- **`transcript`** guarantees text-only processing by extracting audio transcripts, making it compatible with any LLM and reducing token costs.
- **`understand`** mandates raw video processing and fails fast if the model lacks vision support, ensuring access to visual content like slides or diagrams.
- The logic is centralized in [`src/run/flows/url/flow.ts`](https://github.com/steipete/summarize/blob/main/src/run/flows/url/flow.ts) and depends on the `wantsVideoUnderstanding` boolean and model-specific capability checks.

## Frequently Asked Questions

### What is the default video mode in summarize?

The default mode is **`auto`**, which attempts video understanding for capable models (currently Google Gemini) and automatically falls back to transcript extraction for all other models. This default is defined in the configuration schema and [`src/flags.ts`](https://github.com/steipete/summarize/blob/main/src/flags.ts).

### Which models support the understand video mode?

Currently, only **Google Gemini models** (such as Gemini-Pro-Vision) support the `understand` mode because they provide native video transport capabilities. The runner checks for `model.apiStatus.googleConfigured` and specific model kind flags before attempting to send raw video data.

### Can I use understand mode with OpenAI or Anthropic models?

No. If you attempt to use `--video-mode understand` with models that lack video capabilities, the command will abort with an error. For these models, use `--video-mode transcript` to extract spoken content, or use `--video-mode auto` to let the system default to transcription automatically.

### How does transcript mode handle videos without captions?

When operating in `transcript` mode, the pipeline attempts to extract audio using tools like `yt-dlp` if no native captions are available. The extracted audio is then transcribed into text before being sent to the LLM, ensuring you receive a text summary even for uncaptioned content.