Firecrawl Integration Options for Website Extraction in Summarize
Summarize provides three Firecrawl integration modes—off, auto, and always—that control when the hosted Firecrawl service handles website extraction versus the built-in HTML scraper.
The steipete/summarize repository is an open-source tool that extracts main article text from any non-YouTube URL. When the native HTML scraper encounters paywalls, thin content, or blocked pages, the tool can delegate extraction to Firecrawl, a hosted service that returns clean Markdown representations of web pages.
Firecrawl Integration Modes
The integration behavior is governed by three distinct modes parsed from the --firecrawl CLI flag by parseFirecrawlMode in [src/flags.ts (lines 33‑36)](https://github.com/steipete/summarize/blob/main/src/flags.ts#L33-L36). The selected mode is stored in RunSettings.firecrawlMode ([src/run/run-settings.ts lines 27, 347](https://github.com/steipete/summarize/blob/main/src/run/run-settings.ts#L27-L28#L347)) and consulted by the extraction flows.
Off Mode
When set to off, Firecrawl is never invoked. Only the native HTML scraper runs, regardless of content quality or extraction failures.
Auto Mode (Default)
auto is the default behavior when no flag is provided. The HTML scraper executes first; if the extracted content is considered insufficient (below the minimum content budget), the tool automatically retries via Firecrawl and uses the Markdown result. This mode is resolved by resolveFirecrawlMode in the core package ([packages/core/src/content/link-preview/content/utils.ts lines 64‑70](https://github.com/steipete/summarize/blob/main/packages/core/src/content/link-preview/content/utils.ts#L64-L70)).
Always Mode
In always mode, Firecrawl is called first and its result is used even if the HTML scraper could have succeeded. This ensures consistent Markdown formatting and bypasses potential scraping blocks at the cost of API usage.
Configuring Firecrawl via CLI
Control the integration mode using the --firecrawl flag:
# Default (auto) – try Firecrawl only if HTML extraction is thin
summarize https://example.com/article
# Disable Firecrawl completely
summarize --firecrawl off https://example.com/article
# Force Firecrawl to be the primary extractor
summarize --firecrawl always https://example.com/article
The flag accepts off|auto|always. When always is used, you must export FIRECRAWL_API_KEY in your environment (see docs/website.md lines 52‑55).
Programmatic Firecrawl Integration
For library consumers, the core package exposes createFirecrawlScraper to instantiate the scraper directly.
Direct Scraper Usage
import { createFirecrawlScraper } from '@steipete/summarize-core';
import { fetch } from 'node-fetch'; // any fetch implementation
// Build a scraper with your API key
const firecrawl = createFirecrawlScraper({
apiKey: process.env.FIRECRAWL_API_KEY!,
fetchImpl: fetch,
});
// Use it directly – returns Markdown or null if Firecrawl could not extract anything
const result = await firecrawl('https://example.com/article', { timeoutMs: 30_000 });
if (result) {
console.log('Firecrawl Markdown:', result.markdown);
}
Core Extraction Flow Integration
When building custom extraction pipelines, use resolveFirecrawlMode to handle mode resolution:
import { resolveFirecrawlMode } from '@steipete/summarize-core/content/link-preview/content/utils';
import { createFirecrawlScraper } from './firecrawl';
import { fetch } from 'node-fetch';
async function extractWebsite(url: string, options = {}) {
const mode = resolveFirecrawlMode(options); // 'off' | 'auto' | 'always'
if (mode === 'off') {
return htmlExtractor(url);
}
const firecrawl = createFirecrawlScraper({
apiKey: process.env.FIRECRAWL_API_KEY!,
fetchImpl: fetch,
});
// auto: try HTML first, then fall back to Firecrawl
if (mode === 'auto') {
const html = await htmlExtractor(url);
if (html && html.length > MIN_CONTENT) return html;
}
// always or fallback case
const fireResult = await firecrawl(url);
return fireResult?.markdown ?? null;
}
API Key Requirements and Validation
Firecrawl integration requires an API key supplied via the environment variable FIRECRAWL_API_KEY. The runner validates the presence of the key when the mode is set to always, aborting with a clear error if missing ([src/run/runner.ts lines 411‑416](https://github.com/steipete/summarize/blob/main/src/run/runner.ts#L411-L416)).
Runtime Execution Flow
Mode Resolution Pipeline
The CLI flag is processed through a specific pipeline:
- CLI Parsing:
parseFirecrawlModeconverts the string flag to a typed value ([src/run/runner.tslines 322‑346](https://github.com/steipete/summarize/blob/main/src/run/runner.ts#L322-L346)). - Default Resolution:
resolveFirecrawlModein the core package defaults to"auto"when no valid mode is specified ([packages/core/src/content/link-preview/content/utils.tslines 64‑70](https://github.com/steipete/summarize/blob/main/packages/core/src/content/link-preview/content/utils.ts#L64-L70)). - Storage: The resolved mode is stored in
RunSettings.firecrawlModefor access by extraction flows.
Execution Paths by Mode
auto– The HTML scraper runs first. If the extracted content is below the minimum budget (seeapplyContentBudget), afirecrawl-startevent is emitted, the scraper calls the Firecrawl API, and the result replaces the HTML output.always– The runner creates the Firecrawl scraper up-front (createFirecrawlScraper) and uses it as the primary source, bypassing the HTML scraper entirely.off– The Firecrawl-related code path is never reached; only native HTML extraction runs.
Progress Reporting
During execution, the UI prints two debug events:
firecrawl-start– Emitted before the HTTP request.firecrawl-done– Emitted after the request finishes, withokindicating success.
These events are rendered by the TTY progress component ([src/tty/website-progress.ts lines 83‑95](https://github.com/steipete/summarize/blob/main/src/tty/website-progress.ts#L83-L95)) and appear in --verbose output.
Key Implementation Files
| File | Purpose |
|---|---|
src/firecrawl.ts |
Implements createFirecrawlScraper – the thin wrapper around the Firecrawl HTTP API. |
src/flags.ts |
Parses the --firecrawl CLI flag (parseFirecrawlMode). |
src/run/run-settings.ts |
Stores the resolved Firecrawl mode in RunSettings.firecrawlMode. |
src/run/runner.ts |
Reads CLI options, validates the API key, and propagates the mode to the extraction flows. |
src/run/flows/url/flow.ts |
Coordinates website extraction; decides when to instantiate the Firecrawl scraper. |
src/run/flows/asset/media.ts |
Handles "video-only" pages and also respects the Firecrawl mode. |
src/tty/website-progress.ts |
Emits firecrawl-start / firecrawl-done progress events. |
packages/core/src/content/link-preview/content/utils.ts |
Core utility resolveFirecrawlMode used by the library API. |
docs/website.md |
Human-readable documentation of the website extraction pipeline and Firecrawl flags. |
Summary
- Summarize provides three Firecrawl integration modes (
off,auto,always) to control when the hosted Firecrawl service handles website extraction versus the built-in HTML scraper. - The default
automode attempts native extraction first, falling back to Firecrawl only when content is insufficient. alwaysmode bypasses the HTML scraper entirely, using Firecrawl as the primary extractor (requiresFIRECRAWL_API_KEY).offdisables Firecrawl completely, relying solely on native HTML extraction.- Configuration is handled via the
--firecrawlCLI flag or programmatically throughresolveFirecrawlModeandcreateFirecrawlScraperin the core package.
Frequently Asked Questions
What happens if I use --firecrawl always without setting FIRECRAWL_API_KEY?
The runner validates the API key presence when the mode is set to always and aborts with a clear error message before any extraction begins ([src/run/runner.ts lines 411‑416](https://github.com/steipete/summarize/blob/main/src/run/runner.ts#L411-L416)).
How does the auto mode decide when to fall back to Firecrawl?
In auto mode, the HTML scraper runs first. If the extracted content length falls below the internal minimum content budget (determined by applyContentBudget), the tool emits a firecrawl-start event and retries extraction via the Firecrawl API, replacing the HTML output with the Markdown result.
Can I use Firecrawl integration when using Summarize as a library instead of CLI?
Yes. Import createFirecrawlScraper from @steipete/summarize-core to instantiate the scraper directly with your API key, or use resolveFirecrawlMode from the core utilities to handle mode resolution in custom extraction pipelines.
Does Firecrawl mode affect YouTube URL processing?
No. Firecrawl integration applies specifically to non-YouTube URLs. The tool handles YouTube content through a separate extraction pipeline, while the Firecrawl modes (off, auto, always) only govern how standard web pages are processed when the built-in HTML scraper encounters insufficient content.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →