# How YouTube Transcript Fetching Works in steipete/summarize: Caption Tracks, yt-dlp, and Apify

> Learn how steipete/summarize fetches YouTube transcripts using youtubei API, caption tracks, yt-dlp, and Apify. Discover efficient YouTube summarization techniques.

- Repository: [Peter Steinberger/summarize](https://github.com/steipete/summarize)
- Tags: internals
- Published: 2026-02-19

---

**The steipete/summarize library implements a cascading fallback system for YouTube transcript fetching that attempts the youtubei API first, then manual caption tracks, then yt-dlp with Whisper transcription, and finally Apify cloud scraping, automatically selecting the first available strategy based on the configured mode.**

YouTube transcript fetching in the steipete/summarize repository is designed as a resilient, multi-strategy pipeline that maximizes availability across different video types, regional restrictions, and caption configurations. The core orchestration logic resides in [`packages/core/src/content/transcript/providers/youtube.ts`](https://github.com/steipete/summarize/blob/main/packages/core/src/content/transcript/providers/youtube.ts), which coordinates four independent mechanisms: the internal youtubei API, direct caption track extraction, local audio processing via yt-dlp and Whisper, and remote scraping via Apify.

## Transcript Provider Architecture

The orchestrator in [`packages/core/src/content/transcript/providers/youtube.ts`](https://github.com/steipete/summarize/blob/main/packages/core/src/content/transcript/providers/youtube.ts) implements a priority-based fallback chain. When `fetchTranscript` is invoked, the system first extracts the video ID using `extractYouTubeVideoId` from [`packages/core/src/content/transcript/utils.ts`](https://github.com/steipete/summarize/blob/main/packages/core/src/content/transcript/utils.ts), then determines the video duration through a sequence of metadata checks: `extractYoutubeDurationSeconds` (HTML parsing), `fetchYoutubeDurationSecondsViaPlayer` (YouTube i player API), and finally `fetchDurationSecondsWithYtDlp` as a last resort.

Based on the `youtubeTranscriptMode` option (set via the `--youtube` CLI flag), the orchestrator pushes a UI hint via `pushHint` and attempts providers in the following order:

| Mode | Primary Source | Fallback Chain |
|------|---------------|----------------|
| `auto` (default) | youtubei API | caption tracks → yt-dlp → Apify |
| `web` | youtubei API | caption tracks → yt-dlp → Apify |
| `no-auto` | caption tracks (manual only) | yt-dlp |
| `yt-dlp` | yt-dlp (audio download) | none |
| `apify` | Apify (cloud scraper) | none |

## The Four Transcript Fetching Strategies

### youtubei API Extraction

The youtubei provider in [`packages/core/src/content/transcript/providers/youtube/api.ts`](https://github.com/steipete/summarize/blob/main/packages/core/src/content/transcript/providers/youtube/api.ts) extracts a transcript configuration from the bootstrap JSON embedded in the YouTube page using `extractYoutubeiTranscriptConfig`. It then POSTs to `https://www.youtube.com/youtubei/v1/get_transcript` via `fetchTranscriptFromTranscriptEndpoint` and parses the nested response with `extractTranscriptFromTranscriptEndpoint`. This method is fastest when available but fails on videos with restricted API access.

### Caption Track Parsing

The caption tracks provider in [`packages/core/src/content/transcript/providers/youtube/captions.ts`](https://github.com/steipete/summarize/blob/main/packages/core/src/content/transcript/providers/youtube/captions.ts) parses `ytInitialPlayerResponse` from the page HTML using `extractInitialPlayerResponse` to locate `captionTracks` and `automaticCaptions`. It orders tracks to prioritize manual English captions, deduplicates languages, and respects the `skipAutoGenerated` flag. Each track is fetched via `downloadCaptionTrack`, which attempts **JSON-3** format first (preferred) and falls back to **XML** via `downloadXmlTranscript` if necessary.

### yt-dlp with Whisper Transcription

The yt-dlp provider in [`packages/core/src/content/transcript/providers/youtube/yt-dlp.ts`](https://github.com/steipete/summarize/blob/main/packages/core/src/content/transcript/providers/youtube/yt-dlp.ts) handles cases where no captions exist. It runs the external `yt-dlp` binary via `downloadAudio` to extract MP3 audio, using a progress template parsed by `emitProgressFromLine` to report download status. The audio is cached via `mediaCache.put`, its duration probed with `ffprobe`, and then transcribed using `transcribeMediaFileWithWhisper` with the configured provider (whisper-cpp, OpenAI, Groq, or Fal). This approach requires the `yt-dlp` binary and a transcription provider configured in [`packages/core/src/content/transcript/providers/transcription-start.ts`](https://github.com/steipete/summarize/blob/main/packages/core/src/content/transcript/providers/transcription-start.ts).

### Apify Cloud Scraping

The Apify provider in [`packages/core/src/content/transcript/providers/youtube/apify.ts`](https://github.com/steipete/summarize/blob/main/packages/core/src/content/transcript/providers/youtube/apify.ts) serves as the final fallback when local extraction fails. It calls the public Apify actor `faVsWy9VTSNVIhWpR` (pinto-youtube-transcript-scraper) via `fetchTranscriptWithApify`, passing the YouTube URL and optional language preferences. The response is normalized via `normalizePintoTranscript` to match the internal transcript format. This mode requires `APIFY_API_TOKEN` to be set and incurs cloud processing costs but handles geo-restricted or heavily protected videos.

## Programmatic Usage Examples

### Basic API Usage

```typescript
import { fetchTranscript } from "@steipete/summarize-core";

const result = await fetchTranscript(
  { url: "https://www.youtube.com/watch?v=jNQXAC9IVRw", html: null },
  {
    fetch,
    youtubeTranscriptMode: "auto",
    ytDlpPath: "/usr/local/bin/yt-dlp",
    apifyApiToken: process.env.APIFY_API_TOKEN,
    transcription: { provider: "openai" },
    onProgress: (ev) => console.log(ev),
  },
);

console.log(result.text);
console.log(result.source); // "youtubei" | "captionTracks" | "yt-dlp" | "apify"

```

### CLI Usage

```bash

# Automatic mode (default) - tries youtubei → caption tracks → yt-dlp → Apify

summarize "https://www.youtube.com/watch?v=jNQXAC9IVRw"

# Force yt-dlp audio extraction and Whisper transcription

summarize --youtube yt-dlp "https://www.youtube.com/watch?v=jNQXAC9IVRw"

# Skip auto-generated captions, fallback to yt-dlp if no manual captions exist

summarize --youtube no-auto "https://www.youtube.com/watch?v=jNQXAC9IVRw"

# Use Apify cloud scraper (requires APIFY_API_TOKEN)

summarize --youtube apify "https://www.youtube.com/watch?v=jNQXAC9IVRw"

```

### Progress Tracking

```typescript
await fetchTranscript(
  { url: "https://www.youtube.com/watch?v=example", html: null },
  {
    onProgress: (ev) => {
      if (ev.kind === "transcript-start") {
        console.log(`Starting: ${ev.hint}`);
      } else if (ev.kind === "TranscriptMediaDownloadProgress") {
        console.log(`Downloaded ${ev.downloadedBytes}/${ev.totalBytes}`);
      } else if (ev.kind === "TranscriptWhisperProgress") {
        console.log(`Transcribed ${ev.processedDurationSeconds}s / ${ev.totalDurationSeconds}s`);
      }
    },
  }
);

```

## Summary

- **Multi-strategy orchestration**: The [`youtube.ts`](https://github.com/steipete/summarize/blob/main/youtube.ts) provider implements a cascading fallback system that attempts youtubei API, caption tracks, yt-dlp, and Apify in sequence based on the selected mode.

- **Four extraction mechanisms**: Transcripts are retrieved via the youtubei internal API ([`api.ts`](https://github.com/steipete/summarize/blob/main/api.ts)), direct caption track parsing ([`captions.ts`](https://github.com/steipete/summarize/blob/main/captions.ts)), audio download with Whisper transcription ([`yt-dlp.ts`](https://github.com/steipete/summarize/blob/main/yt-dlp.ts)), or cloud-based Apify scraping ([`apify.ts`](https://github.com/steipete/summarize/blob/main/apify.ts)).

- **Flexible configuration**: The `--youtube` CLI flag and `youtubeTranscriptMode` API option control which strategies are attempted, allowing users to prioritize speed (`auto`), accuracy (`no-auto`), or specific backends (`yt-dlp`, `apify`).

- **Robust metadata handling**: Duration extraction follows its own fallback chain (`extractYoutubeDurationSeconds` → `fetchYoutubeDurationSecondsViaPlayer` → `fetchDurationSecondsWithYtDlp`) to ensure accurate timing information regardless of transcript source.

## Frequently Asked Questions

### What happens if all transcript providers fail?

If the youtubei API, caption tracks, yt-dlp, and Apify providers all fail to return a transcript, the `fetchTranscript` function returns a `ProviderResult` with `text` set to `null` and `source` indicating the last attempted provider. The error details are typically logged through the `onProgress` callback, allowing calling code to detect failure and handle it appropriately, such as by prompting the user or skipping the video.

### How does the library handle auto-generated versus manual captions?

The caption track provider in [`packages/core/src/content/transcript/providers/youtube/captions.ts`](https://github.com/steipete/summarize/blob/main/packages/core/src/content/transcript/providers/youtube/captions.ts) prioritizes manual captions over auto-generated ones by ordering the `captionTracks` array so that manual English captions appear first. When using `no-auto` mode, the provider sets `skipAutoGenerated` to `true`, filtering out any tracks marked as automatic. In `auto` or `web` modes, auto-generated captions are accepted as valid fallbacks if no manual captions exist.

### When should I use yt-dlp mode instead of the default auto mode?

Use `yt-dlp` mode when you specifically need audio-based transcription rather than text captions, such as when processing videos where captions are disabled, geo-restricted, or known to be inaccurate. This mode skips the faster API and caption track attempts, downloading the audio stream directly and processing it through Whisper. Note that `yt-dlp` mode requires the `yt-dlp` binary to be installed and accessible, plus a configured Whisper transcription provider (OpenAI, Groq, Fal, or whisper-cpp).

### What is required to use the Apify provider?

The Apify provider requires a valid `APIFY_API_TOKEN` environment variable or configuration option to authenticate with the Apify platform. When using `apify` mode or when the provider is reached as a fallback in `auto` mode, the library calls the public actor `faVsWy9VTSNVIhWpR` (pinto-youtube-transcript-scraper) via `fetchTranscriptWithApify` in [`packages/core/src/content/transcript/providers/youtube/apify.ts`](https://github.com/steipete/summarize/blob/main/packages/core/src/content/transcript/providers/youtube/apify.ts). This approach incurs Apify platform costs but handles cases where local extraction is blocked by IP restrictions or CAPTCHA challenges.