# Firecrawl Integration Options for Website Extraction in Summarize

> Explore Firecrawl integration options in Summarize for website extraction. Choose between off, auto, and always modes to control how your content is scraped.

- Repository: [Peter Steinberger/summarize](https://github.com/steipete/summarize)
- Tags: how-to-guide
- Published: 2026-02-19

---

**Summarize provides three Firecrawl integration modes—`off`, `auto`, and `always`—that control when the hosted Firecrawl service handles website extraction versus the built-in HTML scraper.**

The `steipete/summarize` repository is an open-source tool that extracts main article text from any non-YouTube URL. When the native HTML scraper encounters paywalls, thin content, or blocked pages, the tool can delegate extraction to **Firecrawl**, a hosted service that returns clean Markdown representations of web pages.

## Firecrawl Integration Modes

The integration behavior is governed by three distinct modes parsed from the `--firecrawl` CLI flag by `parseFirecrawlMode` in [[`src/flags.ts`](https://github.com/steipete/summarize/blob/main/src/flags.ts) (lines 33‑36)](https://github.com/steipete/summarize/blob/main/src/flags.ts#L33-L36). The selected mode is stored in `RunSettings.firecrawlMode` ([[`src/run/run-settings.ts`](https://github.com/steipete/summarize/blob/main/src/run/run-settings.ts) lines 27, 347](https://github.com/steipete/summarize/blob/main/src/run/run-settings.ts#L27-L28#L347)) and consulted by the extraction flows.

### Off Mode

When set to `off`, Firecrawl is never invoked. Only the native HTML scraper runs, regardless of content quality or extraction failures.

### Auto Mode (Default)

`auto` is the default behavior when no flag is provided. The HTML scraper executes first; if the extracted content is considered insufficient (below the minimum content budget), the tool automatically retries via Firecrawl and uses the Markdown result. This mode is resolved by `resolveFirecrawlMode` in the core package ([[`packages/core/src/content/link-preview/content/utils.ts`](https://github.com/steipete/summarize/blob/main/packages/core/src/content/link-preview/content/utils.ts) lines 64‑70](https://github.com/steipete/summarize/blob/main/packages/core/src/content/link-preview/content/utils.ts#L64-L70)).

### Always Mode

In `always` mode, Firecrawl is called first and its result is used even if the HTML scraper could have succeeded. This ensures consistent Markdown formatting and bypasses potential scraping blocks at the cost of API usage.

## Configuring Firecrawl via CLI

Control the integration mode using the `--firecrawl` flag:

```bash

# Default (auto) – try Firecrawl only if HTML extraction is thin

summarize https://example.com/article

# Disable Firecrawl completely

summarize --firecrawl off https://example.com/article

# Force Firecrawl to be the primary extractor

summarize --firecrawl always https://example.com/article

```

The flag accepts `off|auto|always`. When `always` is used, you must export `FIRECRAWL_API_KEY` in your environment (see [docs/website.md lines 52‑55](https://github.com/steipete/summarize/blob/main/docs/website.md#L52-L55)).

## Programmatic Firecrawl Integration

For library consumers, the core package exposes `createFirecrawlScraper` to instantiate the scraper directly.

### Direct Scraper Usage

```ts
import { createFirecrawlScraper } from '@steipete/summarize-core';
import { fetch } from 'node-fetch'; // any fetch implementation

// Build a scraper with your API key
const firecrawl = createFirecrawlScraper({
  apiKey: process.env.FIRECRAWL_API_KEY!,
  fetchImpl: fetch,
});

// Use it directly – returns Markdown or null if Firecrawl could not extract anything
const result = await firecrawl('https://example.com/article', { timeoutMs: 30_000 });

if (result) {
  console.log('Firecrawl Markdown:', result.markdown);
}

```

### Core Extraction Flow Integration

When building custom extraction pipelines, use `resolveFirecrawlMode` to handle mode resolution:

```ts
import { resolveFirecrawlMode } from '@steipete/summarize-core/content/link-preview/content/utils';
import { createFirecrawlScraper } from './firecrawl';
import { fetch } from 'node-fetch';

async function extractWebsite(url: string, options = {}) {
  const mode = resolveFirecrawlMode(options); // 'off' | 'auto' | 'always'

  if (mode === 'off') {
    return htmlExtractor(url);
  }

  const firecrawl = createFirecrawlScraper({
    apiKey: process.env.FIRECRAWL_API_KEY!,
    fetchImpl: fetch,
  });

  // auto: try HTML first, then fall back to Firecrawl
  if (mode === 'auto') {
    const html = await htmlExtractor(url);
    if (html && html.length > MIN_CONTENT) return html;
  }

  // always or fallback case
  const fireResult = await firecrawl(url);
  return fireResult?.markdown ?? null;
}

```

## API Key Requirements and Validation

Firecrawl integration requires an API key supplied via the environment variable **`FIRECRAWL_API_KEY`**. The runner validates the presence of the key when the mode is set to `always`, aborting with a clear error if missing ([[`src/run/runner.ts`](https://github.com/steipete/summarize/blob/main/src/run/runner.ts) lines 411‑416](https://github.com/steipete/summarize/blob/main/src/run/runner.ts#L411-L416)).

## Runtime Execution Flow

### Mode Resolution Pipeline

The CLI flag is processed through a specific pipeline:

1. **CLI Parsing**: `parseFirecrawlMode` converts the string flag to a typed value ([[`src/run/runner.ts`](https://github.com/steipete/summarize/blob/main/src/run/runner.ts) lines 322‑346](https://github.com/steipete/summarize/blob/main/src/run/runner.ts#L322-L346)).
2. **Default Resolution**: `resolveFirecrawlMode` in the core package defaults to `"auto"` when no valid mode is specified ([[`packages/core/src/content/link-preview/content/utils.ts`](https://github.com/steipete/summarize/blob/main/packages/core/src/content/link-preview/content/utils.ts) lines 64‑70](https://github.com/steipete/summarize/blob/main/packages/core/src/content/link-preview/content/utils.ts#L64-L70)).
3. **Storage**: The resolved mode is stored in `RunSettings.firecrawlMode` for access by extraction flows.

### Execution Paths by Mode

- **`auto`** – The HTML scraper runs first. If the extracted content is below the minimum budget (see `applyContentBudget`), a `firecrawl-start` event is emitted, the scraper calls the Firecrawl API, and the result replaces the HTML output.
- **`always`** – The runner creates the Firecrawl scraper up-front (`createFirecrawlScraper`) and uses it as the primary source, bypassing the HTML scraper entirely.
- **`off`** – The Firecrawl-related code path is never reached; only native HTML extraction runs.

### Progress Reporting

During execution, the UI prints two debug events:

- `firecrawl-start` – Emitted before the HTTP request.
- `firecrawl-done` – Emitted after the request finishes, with `ok` indicating success.

These events are rendered by the TTY progress component ([[`src/tty/website-progress.ts`](https://github.com/steipete/summarize/blob/main/src/tty/website-progress.ts) lines 83‑95](https://github.com/steipete/summarize/blob/main/src/tty/website-progress.ts#L83-L95)) and appear in `--verbose` output.

## Key Implementation Files

| File | Purpose |
|------|---------|
| **[`src/firecrawl.ts`](https://github.com/steipete/summarize/blob/main/src/firecrawl.ts)** | Implements `createFirecrawlScraper` – the thin wrapper around the Firecrawl HTTP API. |
| **[`src/flags.ts`](https://github.com/steipete/summarize/blob/main/src/flags.ts)** | Parses the `--firecrawl` CLI flag (`parseFirecrawlMode`). |
| **[`src/run/run-settings.ts`](https://github.com/steipete/summarize/blob/main/src/run/run-settings.ts)** | Stores the resolved Firecrawl mode in `RunSettings.firecrawlMode`. |
| **[`src/run/runner.ts`](https://github.com/steipete/summarize/blob/main/src/run/runner.ts)** | Reads CLI options, validates the API key, and propagates the mode to the extraction flows. |
| **[`src/run/flows/url/flow.ts`](https://github.com/steipete/summarize/blob/main/src/run/flows/url/flow.ts)** | Coordinates website extraction; decides when to instantiate the Firecrawl scraper. |
| **[`src/run/flows/asset/media.ts`](https://github.com/steipete/summarize/blob/main/src/run/flows/asset/media.ts)** | Handles "video-only" pages and also respects the Firecrawl mode. |
| **[`src/tty/website-progress.ts`](https://github.com/steipete/summarize/blob/main/src/tty/website-progress.ts)** | Emits `firecrawl-start` / `firecrawl-done` progress events. |
| **[`packages/core/src/content/link-preview/content/utils.ts`](https://github.com/steipete/summarize/blob/main/packages/core/src/content/link-preview/content/utils.ts)** | Core utility `resolveFirecrawlMode` used by the library API. |
| **[`docs/website.md`](https://github.com/steipete/summarize/blob/main/docs/website.md)** | Human-readable documentation of the website extraction pipeline and Firecrawl flags. |

## Summary

- Summarize provides **three Firecrawl integration modes** (`off`, `auto`, `always`) to control when the hosted Firecrawl service handles website extraction versus the built-in HTML scraper.
- The default **`auto`** mode attempts native extraction first, falling back to Firecrawl only when content is insufficient.
- **`always`** mode bypasses the HTML scraper entirely, using Firecrawl as the primary extractor (requires `FIRECRAWL_API_KEY`).
- **`off`** disables Firecrawl completely, relying solely on native HTML extraction.
- Configuration is handled via the `--firecrawl` CLI flag or programmatically through `resolveFirecrawlMode` and `createFirecrawlScraper` in the core package.

## Frequently Asked Questions

### What happens if I use `--firecrawl always` without setting FIRECRAWL_API_KEY?

The runner validates the API key presence when the mode is set to `always` and aborts with a clear error message before any extraction begins ([[`src/run/runner.ts`](https://github.com/steipete/summarize/blob/main/src/run/runner.ts) lines 411‑416](https://github.com/steipete/summarize/blob/main/src/run/runner.ts#L411-L416)).

### How does the auto mode decide when to fall back to Firecrawl?

In `auto` mode, the HTML scraper runs first. If the extracted content length falls below the internal minimum content budget (determined by `applyContentBudget`), the tool emits a `firecrawl-start` event and retries extraction via the Firecrawl API, replacing the HTML output with the Markdown result.

### Can I use Firecrawl integration when using Summarize as a library instead of CLI?

Yes. Import `createFirecrawlScraper` from `@steipete/summarize-core` to instantiate the scraper directly with your API key, or use `resolveFirecrawlMode` from the core utilities to handle mode resolution in custom extraction pipelines.

### Does Firecrawl mode affect YouTube URL processing?

No. Firecrawl integration applies specifically to non-YouTube URLs. The tool handles YouTube content through a separate extraction pipeline, while the Firecrawl modes (`off`, `auto`, `always`) only govern how standard web pages are processed when the built-in HTML scraper encounters insufficient content.