how-to-guide

Firecrawl Integration Options for Website Extraction in Summarize

February 19, 2026 steipete/summarize ↗

Summarize provides three Firecrawl integration modes—off, auto, and always—that control when the hosted Firecrawl service handles website extraction versus the built-in HTML scraper.

The steipete/summarize repository is an open-source tool that extracts main article text from any non-YouTube URL. When the native HTML scraper encounters paywalls, thin content, or blocked pages, the tool can delegate extraction to Firecrawl, a hosted service that returns clean Markdown representations of web pages.

Firecrawl Integration Modes

The integration behavior is governed by three distinct modes parsed from the --firecrawl CLI flag by parseFirecrawlMode in [src/flags.ts (lines 33‑36)](https://github.com/steipete/summarize/blob/main/src/flags.ts#L33-L36). The selected mode is stored in RunSettings.firecrawlMode ([src/run/run-settings.ts lines 27, 347](https://github.com/steipete/summarize/blob/main/src/run/run-settings.ts#L27-L28#L347)) and consulted by the extraction flows.

Off Mode

When set to off, Firecrawl is never invoked. Only the native HTML scraper runs, regardless of content quality or extraction failures.

Auto Mode (Default)

auto is the default behavior when no flag is provided. The HTML scraper executes first; if the extracted content is considered insufficient (below the minimum content budget), the tool automatically retries via Firecrawl and uses the Markdown result. This mode is resolved by resolveFirecrawlMode in the core package ([packages/core/src/content/link-preview/content/utils.ts lines 64‑70](https://github.com/steipete/summarize/blob/main/packages/core/src/content/link-preview/content/utils.ts#L64-L70)).

Always Mode

In always mode, Firecrawl is called first and its result is used even if the HTML scraper could have succeeded. This ensures consistent Markdown formatting and bypasses potential scraping blocks at the cost of API usage.

Configuring Firecrawl via CLI

Control the integration mode using the --firecrawl flag:


# Default (auto) – try Firecrawl only if HTML extraction is thin

summarize https://example.com/article

# Disable Firecrawl completely

summarize --firecrawl off https://example.com/article

# Force Firecrawl to be the primary extractor

summarize --firecrawl always https://example.com/article

The flag accepts off|auto|always. When always is used, you must export FIRECRAWL_API_KEY in your environment (see docs/website.md lines 52‑55).

Programmatic Firecrawl Integration

For library consumers, the core package exposes createFirecrawlScraper to instantiate the scraper directly.

Direct Scraper Usage

import { createFirecrawlScraper } from '@steipete/summarize-core';
import { fetch } from 'node-fetch'; // any fetch implementation

// Build a scraper with your API key
const firecrawl = createFirecrawlScraper({
  apiKey: process.env.FIRECRAWL_API_KEY!,
  fetchImpl: fetch,
});

// Use it directly – returns Markdown or null if Firecrawl could not extract anything
const result = await firecrawl('https://example.com/article', { timeoutMs: 30_000 });

if (result) {
  console.log('Firecrawl Markdown:', result.markdown);
}

Core Extraction Flow Integration

When building custom extraction pipelines, use resolveFirecrawlMode to handle mode resolution:

import { resolveFirecrawlMode } from '@steipete/summarize-core/content/link-preview/content/utils';
import { createFirecrawlScraper } from './firecrawl';
import { fetch } from 'node-fetch';

async function extractWebsite(url: string, options = {}) {
  const mode = resolveFirecrawlMode(options); // 'off' | 'auto' | 'always'

  if (mode === 'off') {
    return htmlExtractor(url);
  }

  const firecrawl = createFirecrawlScraper({
    apiKey: process.env.FIRECRAWL_API_KEY!,
    fetchImpl: fetch,
  });

  // auto: try HTML first, then fall back to Firecrawl
  if (mode === 'auto') {
    const html = await htmlExtractor(url);
    if (html && html.length > MIN_CONTENT) return html;
  }

  // always or fallback case
  const fireResult = await firecrawl(url);
  return fireResult?.markdown ?? null;
}

API Key Requirements and Validation

Firecrawl integration requires an API key supplied via the environment variable FIRECRAWL_API_KEY. The runner validates the presence of the key when the mode is set to always, aborting with a clear error if missing ([src/run/runner.ts lines 411‑416](https://github.com/steipete/summarize/blob/main/src/run/runner.ts#L411-L416)).

Runtime Execution Flow

Mode Resolution Pipeline

The CLI flag is processed through a specific pipeline:

CLI Parsing: parseFirecrawlMode converts the string flag to a typed value ([src/run/runner.ts lines 322‑346](https://github.com/steipete/summarize/blob/main/src/run/runner.ts#L322-L346)).
Default Resolution: resolveFirecrawlMode in the core package defaults to "auto" when no valid mode is specified ([packages/core/src/content/link-preview/content/utils.ts lines 64‑70](https://github.com/steipete/summarize/blob/main/packages/core/src/content/link-preview/content/utils.ts#L64-L70)).
Storage: The resolved mode is stored in RunSettings.firecrawlMode for access by extraction flows.

Execution Paths by Mode

auto – The HTML scraper runs first. If the extracted content is below the minimum budget (see applyContentBudget), a firecrawl-start event is emitted, the scraper calls the Firecrawl API, and the result replaces the HTML output.
always – The runner creates the Firecrawl scraper up-front (createFirecrawlScraper) and uses it as the primary source, bypassing the HTML scraper entirely.
off – The Firecrawl-related code path is never reached; only native HTML extraction runs.

Progress Reporting

During execution, the UI prints two debug events:

firecrawl-start – Emitted before the HTTP request.
firecrawl-done – Emitted after the request finishes, with ok indicating success.

These events are rendered by the TTY progress component ([src/tty/website-progress.ts lines 83‑95](https://github.com/steipete/summarize/blob/main/src/tty/website-progress.ts#L83-L95)) and appear in --verbose output.

Key Implementation Files

File	Purpose
`src/firecrawl.ts`	Implements `createFirecrawlScraper` – the thin wrapper around the Firecrawl HTTP API.
`src/flags.ts`	Parses the `--firecrawl` CLI flag (`parseFirecrawlMode`).
`src/run/run-settings.ts`	Stores the resolved Firecrawl mode in `RunSettings.firecrawlMode`.
`src/run/runner.ts`	Reads CLI options, validates the API key, and propagates the mode to the extraction flows.
`src/run/flows/url/flow.ts`	Coordinates website extraction; decides when to instantiate the Firecrawl scraper.
`src/run/flows/asset/media.ts`	Handles "video-only" pages and also respects the Firecrawl mode.
`src/tty/website-progress.ts`	Emits `firecrawl-start` / `firecrawl-done` progress events.
`packages/core/src/content/link-preview/content/utils.ts`	Core utility `resolveFirecrawlMode` used by the library API.
`docs/website.md`	Human-readable documentation of the website extraction pipeline and Firecrawl flags.

Summary

Summarize provides three Firecrawl integration modes (off, auto, always) to control when the hosted Firecrawl service handles website extraction versus the built-in HTML scraper.
The default auto mode attempts native extraction first, falling back to Firecrawl only when content is insufficient.
always mode bypasses the HTML scraper entirely, using Firecrawl as the primary extractor (requires FIRECRAWL_API_KEY).
off disables Firecrawl completely, relying solely on native HTML extraction.
Configuration is handled via the --firecrawl CLI flag or programmatically through resolveFirecrawlMode and createFirecrawlScraper in the core package.

Frequently Asked Questions

What happens if I use `--firecrawl always` without setting FIRECRAWL_API_KEY?

The runner validates the API key presence when the mode is set to always and aborts with a clear error message before any extraction begins ([src/run/runner.ts lines 411‑416](https://github.com/steipete/summarize/blob/main/src/run/runner.ts#L411-L416)).

How does the auto mode decide when to fall back to Firecrawl?

In auto mode, the HTML scraper runs first. If the extracted content length falls below the internal minimum content budget (determined by applyContentBudget), the tool emits a firecrawl-start event and retries extraction via the Firecrawl API, replacing the HTML output with the Markdown result.

Can I use Firecrawl integration when using Summarize as a library instead of CLI?

Yes. Import createFirecrawlScraper from @steipete/summarize-core to instantiate the scraper directly with your API key, or use resolveFirecrawlMode from the core utilities to handle mode resolution in custom extraction pipelines.

Does Firecrawl mode affect YouTube URL processing?

No. Firecrawl integration applies specifically to non-YouTube URLs. The tool handles YouTube content through a separate extraction pipeline, while the Firecrawl modes (off, auto, always) only govern how standard web pages are processed when the built-in HTML scraper encounters insufficient content.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:

curl -s "https://instagit.com/install.md"

Add to your MCP client configuration:

{
  "mcpServers": {
    "instagit": {
      "command": "npx",
      "args": ["-y", "instagit@latest"]
    }
  }
}

Ask your agent:

"Use Instagit MCP to understand how steipete/summarize works."

Works with

Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →