Performance Implications of Different Transcriber Options in Summarize

Choosing the optimal transcriber in steipete/summarize can reduce transcription latency from 30 seconds to under 3 seconds per minute of audio, with local ONNX models offering the fastest CPU-only performance while cloud providers eliminate local resource contention.

The steipete/summarize repository provides four distinct transcriber modes that trade speed, cost, and hardware requirements differently. Understanding the performance implications of different transcriber options allows you to optimize for latency, accuracy, and infrastructure constraints based on your specific environment.

The Four Transcriber Modes

Summarize obtains audio transcripts through four transcriber modes, each with distinct performance characteristics.

Auto Mode

The auto mode chooses the fastest available option at runtime according to a strict priority order. If a local ONNX command is configured, it executes first; otherwise it attempts whisper.cpp (local), and finally falls back to a cloud provider (OpenAI/Groq).

Latency varies based on your configuration—local ONNX is usually the fastest due to zero network round-trip, while cloud providers add network latency but may leverage GPU acceleration. This mode requires no GPU for ONNX (CPU-only), while cloud providers use their own GPUs and whisper.cpp can be compiled with CUDA for GPU speed.

Whisper.cpp Local

The whisper mode executes the binary defined by the SUMMARIZE_WHISPER_CPP_BINARY environment variable (defaulting to whisper-cli). This binary runs the Whisper model locally without network dependencies.

Typical latency ranges from 5–30 seconds per minute of audio depending on CPU speed, with significantly faster performance if compiled with GPU support. This option requires a capable GPU for optimal performance unless you accept CPU-only execution.

Parakeet ONNX

The parakeet mode calls a user-supplied CLI via SUMMARIZE_ONNX_PARAKEET_CMD that runs the NVIDIA Parakeet-TDT 0.6B-v3 ONNX model. Model files download once into ~/.cache/summarize/onnx/parakeet.

This delivers 1–3 seconds per minute of audio on a modern CPU because the 0.6B parameter model is tiny and requires no network latency. The inference runs purely on CPU through the external CLI with no GPU required.

Canary ONNX

The canary mode uses the same mechanism as parakeet but runs the larger Canary 1B-v2 model via SUMMARIZE_ONNX_CANARY_CMD.

Expect 2–5 seconds per minute of audio on a typical laptop CPU. While still CPU-only, the larger model size requires more compute cycles than Parakeet but delivers higher transcription accuracy.

Critical Performance Factors

Network Round-Trip

Cloud transcribers (OpenAI, Groq, etc.) add at least a few hundred milliseconds of latency per request plus bandwidth delays. Local ONNX modes avoid this entirely by processing audio on your machine.

Model Size and Compute

Parakeet’s 0.6B model runs quickly on a single CPU core, while Canary’s 1B model needs more CPU cycles. Whisper’s base model is larger still and runs slower unless you use a GPU-accelerated binary. According to the source code in packages/core/src/transcription/onnx-cli.ts (lines 243-261), the ONNX pathway streams audio to the external CLI and captures stdout with ffmpeg fallback handling.

Cost Considerations

Cloud transcription incurs per-token or per-second charges. ONNX and whisper.cpp are free after the initial binary and model download, making them cost-effective for high-volume transcription.

Resource Contention

Running a CPU-bound ONNX transcriber shares resources with the rest of Summarize’s pipeline (e.g., downloading the video). In heavily loaded environments, you may prefer the off-loaded cloud option to prevent local resource starvation.

Source Code Implementation

Configuration and Validation

The CLI flag parsing and validation occurs in src/run/run-settings.ts. The resolveRunOverrides function normalizes the --transcriber flag and validates it, guaranteeing only the four supported values are accepted (lines 326-358).

Runtime Selection Logic

In src/run/runner.ts, the environment variable SUMMARIZE_TRANSCRIBER (defaulting to auto) is resolved and stored in envForRun.SUMMARIZE_TRANSCRIBER (lines 286-296). This determines which transcriber is actually invoked at runtime based on your configuration.

Execution Pipeline

The handleTranscriberCliRequest function in src/run/transcriber-cli.ts (lines 45-78) prints the current configuration and shows which binaries and models are present. The actual ONNX transcription logic lives in packages/core/src/transcription/onnx-cli.ts, which handles the external CLI call, ffmpeg transcoding, and fallback handling.

Practical Configuration Examples

Use the Default Auto Mode

Let Summarize pick the fastest available option automatically:

summarize "https://example.com/video.mp4" --slides

If you have configured an ONNX command, Summarize runs it first; otherwise it falls back to Whisper or a cloud provider.

Force a Specific ONNX Model

Set the transcriber via environment variable:

SUMMARIZE_TRANSCRIBER=parakeet summarize "https://example.com/podcast.wav"

Or use the CLI flag:

summarize "https://example.com/podcast.wav" --transcriber parakeet

Configure the ONNX CLI Command

Set up the external CLI once by defining the command structure:

export SUMMARIZE_ONNX_PARAKEET_CMD='["sherpa-onnx", "--tokens", "{vocab}", "--offline-ctc-model", "{model}", "--input-wav", "{input}"]'

Verify your configuration:

summarize transcriber setup

The setup command prints the cache directory (~/.cache/summarize/onnx/), model download status, and binary reachability.

Use a Local Whisper-cpp Binary

Point to your compiled binary for local GPU-accelerated transcription:

export SUMMARIZE_WHISPER_CPP_BINARY="whisper-cli"
summarize "https://example.com/lecture.mp4" --transcriber whisper

Force Cloud Provider Fallback

When local options are unavailable, explicitly allow cloud fallback:

SUMMARIZE_TRANSCRIBER=auto summarize "https://example.com/meeting.mp4"

The cloud provider selection respects your model configuration via SUMMARIZE_MODEL and SUMMARIZE_PROVIDER.

Summary

  • ONNX (parakeet/canary) provides the lowest latency (1–5 seconds per minute) and zero monetary cost, but requires CPU resources and external CLI installation.
  • Whisper-cpp offers a middle ground with local execution and optional GPU acceleration, though it requires 5–30 seconds per minute unless GPU-accelerated.
  • Cloud providers trade zero-setup for network latency and per-usage fees, potentially offering faster processing on remote GPUs.
  • Auto mode safely selects the fastest available option by checking ONNX, then Whisper, then cloud, ensuring optimal performance without manual tuning.

Frequently Asked Questions

Which transcriber option provides the fastest transcription speed?

The parakeet ONNX mode delivers the fastest transcription at 1–3 seconds per minute of audio on modern CPUs. This outperforms Whisper.cpp (5–30 seconds) and cloud providers (which add network latency). The speed comes from Parakeet’s small 0.6B parameter model and local CPU execution without network round-trips.

Does using a cloud transcriber always guarantee better performance than local options?

No. While cloud providers run on powerful GPUs, they introduce network round-trip latency of several hundred milliseconds plus upload bandwidth constraints. For short audio files or when network conditions are poor, local ONNX models often complete faster. Cloud options excel when local CPU resources are constrained or when processing extremely long files that would monopolize your machine.

How does the auto mode decide which transcriber to use?

The auto mode follows a strict priority hierarchy defined in src/run/runner.ts. It first checks for a configured ONNX command (SUMMARIZE_ONNX_PARAKEET_CMD or SUMMARIZE_ONNX_CANARY_CMD), then attempts to locate the Whisper.cpp binary (SUMMARIZE_WHISPER_CPP_BINARY), and finally falls back to cloud providers if no local options are available. This ensures you always get the fastest available option without manual flag configuration.

What hardware requirements are necessary for GPU-accelerated transcription?

Only Whisper.cpp supports local GPU acceleration, and only when compiled with CUDA support. Set SUMMARIZE_WHISPER_CPP_BINARY to point to your CUDA-enabled binary. The ONNX modes (parakeet and canary) run exclusively on CPU and do not utilize GPUs. Cloud providers handle GPU acceleration on their infrastructure, requiring no local GPU from your machine.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →