Summarize Video Mode Options: Auto, Transcript, and Understand Explained

The summarize CLI offers three video processing modes—auto (default), transcript, and understand—that determine whether the tool extracts text from audio or sends raw video files to vision-capable LLMs.

When processing video URLs with the steipete/summarize repository, the --video-mode flag controls how the pipeline handles multimedia input. The option type is defined in src/flags.ts and integrates with the runner configuration to decide between transcription and native video understanding.

The Three Video Processing Modes

Auto Mode (Default)

auto is the default behavior that prefers video understanding when the selected LLM model supports video attachments. If the model cannot process video files—such as most OpenAI or Anthropic models—the runner quietly falls back to transcript mode.

Use this mode when you want the best possible result without manually checking model capabilities. The pipeline automatically switches strategies based on the model.apiStatus.googleConfigured check and transport availability in src/run/flows/url/flow.ts.

Transcript Mode

transcript forces the pipeline to extract or fetch a text transcript from the video source and passes only that text to the LLM. In this mode, no video data is sent to the model.

Choose this option when you know your target model lacks vision capabilities, when you need deterministic text-only summarization for cost control, or when processing long videos where only the spoken content matters. This mode sets mediaTranscript: "prefer" in the request configuration.

Understand Mode

understand strictly requires the model to process the raw video file directly and aborts with an error if the model cannot handle video inputs. Unlike auto, this mode never falls back to transcription.

Use this when visual information—such as slides, on-screen text, or visual cues—is critical to the summary. Currently, this requires Google Gemini models (e.g., Gemini-Pro-Vision) that support native video transport. The runner verifies this capability through the wantsVideoUnderstanding boolean flag combined with model kind checks before calling loadRemoteAsset.

How the Runner Decides

The decision logic flows through three key stages in the codebase:

  1. Parsing – The parseVideoMode function in src/flags.ts validates the CLI input against the three allowed strings and throws on invalid values.

  2. Configuration Merging – The resolved value lands in RunConfig.videoMode via src/run/run-config.ts, merging CLI flags, config file settings (media.videoMode), or defaulting to "auto".

  3. Execution Branching – In src/run/flows/url/flow.ts, the runner evaluates flags.videoMode alongside model capabilities:

    • If videoMode === "understand" or "auto" and the model supports native video transport, the runner downloads the video and attaches it to the LLM request.
    • If the model lacks video support, auto silently falls back to transcript extraction, while understand throws an error.
    • The transcript mode bypasses video download entirely, relying on caption extraction or yt-dlp fallback.

Additional checks in src/run/flows/asset/summary.ts verify ctx.videoMode !== "transcript" before enabling video-understanding code paths.

Usage Examples

Command Line Interface


# Let the runner pick the best available method (default)

summarize "https://www.youtube.com/watch?v=abc123" --video-mode auto

# Force text extraction only, no video upload

summarize "https://example.com/video.mp4" --video-mode transcript

# Require native video understanding (Gemini only)

summarize "https://example.com/video.mp4" --video-mode understand

Programmatic API

import { Summarizer } from "@steipete/summarize-core";

await Summarizer.run({
  input: "https://example.com/video.mp4",
  videoMode: "understand", // "auto" | "transcript" | "understand"
});

Summary

  • auto intelligently selects between video understanding and transcription based on model capabilities, providing a hands-off experience.
  • transcript guarantees text-only processing by extracting audio transcripts, making it compatible with any LLM and reducing token costs.
  • understand mandates raw video processing and fails fast if the model lacks vision support, ensuring access to visual content like slides or diagrams.
  • The logic is centralized in src/run/flows/url/flow.ts and depends on the wantsVideoUnderstanding boolean and model-specific capability checks.

Frequently Asked Questions

What is the default video mode in summarize?

The default mode is auto, which attempts video understanding for capable models (currently Google Gemini) and automatically falls back to transcript extraction for all other models. This default is defined in the configuration schema and src/flags.ts.

Which models support the understand video mode?

Currently, only Google Gemini models (such as Gemini-Pro-Vision) support the understand mode because they provide native video transport capabilities. The runner checks for model.apiStatus.googleConfigured and specific model kind flags before attempting to send raw video data.

Can I use understand mode with OpenAI or Anthropic models?

No. If you attempt to use --video-mode understand with models that lack video capabilities, the command will abort with an error. For these models, use --video-mode transcript to extract spoken content, or use --video-mode auto to let the system default to transcription automatically.

How does transcript mode handle videos without captions?

When operating in transcript mode, the pipeline attempts to extract audio using tools like yt-dlp if no native captions are available. The extracted audio is then transcribed into text before being sent to the LLM, ensuring you receive a text summary even for uncaptioned content.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →