How YouTube Transcript Fetching Works in steipete/summarize: Caption Tracks, yt-dlp, and Apify

The steipete/summarize library implements a cascading fallback system for YouTube transcript fetching that attempts the youtubei API first, then manual caption tracks, then yt-dlp with Whisper transcription, and finally Apify cloud scraping, automatically selecting the first available strategy based on the configured mode.

YouTube transcript fetching in the steipete/summarize repository is designed as a resilient, multi-strategy pipeline that maximizes availability across different video types, regional restrictions, and caption configurations. The core orchestration logic resides in packages/core/src/content/transcript/providers/youtube.ts, which coordinates four independent mechanisms: the internal youtubei API, direct caption track extraction, local audio processing via yt-dlp and Whisper, and remote scraping via Apify.

Transcript Provider Architecture

The orchestrator in packages/core/src/content/transcript/providers/youtube.ts implements a priority-based fallback chain. When fetchTranscript is invoked, the system first extracts the video ID using extractYouTubeVideoId from packages/core/src/content/transcript/utils.ts, then determines the video duration through a sequence of metadata checks: extractYoutubeDurationSeconds (HTML parsing), fetchYoutubeDurationSecondsViaPlayer (YouTube i player API), and finally fetchDurationSecondsWithYtDlp as a last resort.

Based on the youtubeTranscriptMode option (set via the --youtube CLI flag), the orchestrator pushes a UI hint via pushHint and attempts providers in the following order:

Mode Primary Source Fallback Chain
auto (default) youtubei API caption tracks → yt-dlp → Apify
web youtubei API caption tracks → yt-dlp → Apify
no-auto caption tracks (manual only) yt-dlp
yt-dlp yt-dlp (audio download) none
apify Apify (cloud scraper) none

The Four Transcript Fetching Strategies

youtubei API Extraction

The youtubei provider in packages/core/src/content/transcript/providers/youtube/api.ts extracts a transcript configuration from the bootstrap JSON embedded in the YouTube page using extractYoutubeiTranscriptConfig. It then POSTs to https://www.youtube.com/youtubei/v1/get_transcript via fetchTranscriptFromTranscriptEndpoint and parses the nested response with extractTranscriptFromTranscriptEndpoint. This method is fastest when available but fails on videos with restricted API access.

Caption Track Parsing

The caption tracks provider in packages/core/src/content/transcript/providers/youtube/captions.ts parses ytInitialPlayerResponse from the page HTML using extractInitialPlayerResponse to locate captionTracks and automaticCaptions. It orders tracks to prioritize manual English captions, deduplicates languages, and respects the skipAutoGenerated flag. Each track is fetched via downloadCaptionTrack, which attempts JSON-3 format first (preferred) and falls back to XML via downloadXmlTranscript if necessary.

yt-dlp with Whisper Transcription

The yt-dlp provider in packages/core/src/content/transcript/providers/youtube/yt-dlp.ts handles cases where no captions exist. It runs the external yt-dlp binary via downloadAudio to extract MP3 audio, using a progress template parsed by emitProgressFromLine to report download status. The audio is cached via mediaCache.put, its duration probed with ffprobe, and then transcribed using transcribeMediaFileWithWhisper with the configured provider (whisper-cpp, OpenAI, Groq, or Fal). This approach requires the yt-dlp binary and a transcription provider configured in packages/core/src/content/transcript/providers/transcription-start.ts.

Apify Cloud Scraping

The Apify provider in packages/core/src/content/transcript/providers/youtube/apify.ts serves as the final fallback when local extraction fails. It calls the public Apify actor faVsWy9VTSNVIhWpR (pinto-youtube-transcript-scraper) via fetchTranscriptWithApify, passing the YouTube URL and optional language preferences. The response is normalized via normalizePintoTranscript to match the internal transcript format. This mode requires APIFY_API_TOKEN to be set and incurs cloud processing costs but handles geo-restricted or heavily protected videos.

Programmatic Usage Examples

Basic API Usage

import { fetchTranscript } from "@steipete/summarize-core";

const result = await fetchTranscript(
  { url: "https://www.youtube.com/watch?v=jNQXAC9IVRw", html: null },
  {
    fetch,
    youtubeTranscriptMode: "auto",
    ytDlpPath: "/usr/local/bin/yt-dlp",
    apifyApiToken: process.env.APIFY_API_TOKEN,
    transcription: { provider: "openai" },
    onProgress: (ev) => console.log(ev),
  },
);

console.log(result.text);
console.log(result.source); // "youtubei" | "captionTracks" | "yt-dlp" | "apify"

CLI Usage


# Automatic mode (default) - tries youtubei → caption tracks → yt-dlp → Apify

summarize "https://www.youtube.com/watch?v=jNQXAC9IVRw"

# Force yt-dlp audio extraction and Whisper transcription

summarize --youtube yt-dlp "https://www.youtube.com/watch?v=jNQXAC9IVRw"

# Skip auto-generated captions, fallback to yt-dlp if no manual captions exist

summarize --youtube no-auto "https://www.youtube.com/watch?v=jNQXAC9IVRw"

# Use Apify cloud scraper (requires APIFY_API_TOKEN)

summarize --youtube apify "https://www.youtube.com/watch?v=jNQXAC9IVRw"

Progress Tracking

await fetchTranscript(
  { url: "https://www.youtube.com/watch?v=example", html: null },
  {
    onProgress: (ev) => {
      if (ev.kind === "transcript-start") {
        console.log(`Starting: ${ev.hint}`);
      } else if (ev.kind === "TranscriptMediaDownloadProgress") {
        console.log(`Downloaded ${ev.downloadedBytes}/${ev.totalBytes}`);
      } else if (ev.kind === "TranscriptWhisperProgress") {
        console.log(`Transcribed ${ev.processedDurationSeconds}s / ${ev.totalDurationSeconds}s`);
      }
    },
  }
);

Summary

  • Multi-strategy orchestration: The youtube.ts provider implements a cascading fallback system that attempts youtubei API, caption tracks, yt-dlp, and Apify in sequence based on the selected mode.

  • Four extraction mechanisms: Transcripts are retrieved via the youtubei internal API (api.ts), direct caption track parsing (captions.ts), audio download with Whisper transcription (yt-dlp.ts), or cloud-based Apify scraping (apify.ts).

  • Flexible configuration: The --youtube CLI flag and youtubeTranscriptMode API option control which strategies are attempted, allowing users to prioritize speed (auto), accuracy (no-auto), or specific backends (yt-dlp, apify).

  • Robust metadata handling: Duration extraction follows its own fallback chain (extractYoutubeDurationSecondsfetchYoutubeDurationSecondsViaPlayerfetchDurationSecondsWithYtDlp) to ensure accurate timing information regardless of transcript source.

Frequently Asked Questions

What happens if all transcript providers fail?

If the youtubei API, caption tracks, yt-dlp, and Apify providers all fail to return a transcript, the fetchTranscript function returns a ProviderResult with text set to null and source indicating the last attempted provider. The error details are typically logged through the onProgress callback, allowing calling code to detect failure and handle it appropriately, such as by prompting the user or skipping the video.

How does the library handle auto-generated versus manual captions?

The caption track provider in packages/core/src/content/transcript/providers/youtube/captions.ts prioritizes manual captions over auto-generated ones by ordering the captionTracks array so that manual English captions appear first. When using no-auto mode, the provider sets skipAutoGenerated to true, filtering out any tracks marked as automatic. In auto or web modes, auto-generated captions are accepted as valid fallbacks if no manual captions exist.

When should I use yt-dlp mode instead of the default auto mode?

Use yt-dlp mode when you specifically need audio-based transcription rather than text captions, such as when processing videos where captions are disabled, geo-restricted, or known to be inaccurate. This mode skips the faster API and caption track attempts, downloading the audio stream directly and processing it through Whisper. Note that yt-dlp mode requires the yt-dlp binary to be installed and accessible, plus a configured Whisper transcription provider (OpenAI, Groq, Fal, or whisper-cpp).

What is required to use the Apify provider?

The Apify provider requires a valid APIFY_API_TOKEN environment variable or configuration option to authenticate with the Apify platform. When using apify mode or when the provider is reached as a fallback in auto mode, the library calls the public actor faVsWy9VTSNVIhWpR (pinto-youtube-transcript-scraper) via fetchTranscriptWithApify in packages/core/src/content/transcript/providers/youtube/apify.ts. This approach incurs Apify platform costs but handles cases where local extraction is blocked by IP restrictions or CAPTCHA challenges.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →