Handling Rate Limit Errors and Robots.txt Compliance in Firecrawl: A Complete Guide
Firecrawl handles rate limit errors through a Redis-backed rate limiter that returns HTTP 429 responses with retry-after headers, while enforcing robots.txt compliance via the isUrlAllowedByRobots function in apps/api/src/lib/robots-txt.ts that blocks disallowed URLs with HTTP 403 responses.
The Firecrawl repository implements a dual-layer protection system to ensure polite, reliable web crawling. By combining strict rate limiting with automatic robots.txt validation, the platform protects both its own infrastructure and the target websites it scrapes. This guide examines the source code implementation of these mechanisms and provides practical examples for handling rate limit errors and robots.txt compliance in your Firecrawl integrations.
How Firecrawl Enforces Rate Limits
Firecrawl’s rate limiting architecture centers on a Redis-backed token bucket implementation that tracks request consumption across different API modes. The system ensures fair resource allocation while preventing abuse through dynamic quota management.
Redis-Backed Rate Limiting Architecture
The core rate limiting logic resides in apps/api/src/services/rate-limiter.ts, which instantiates RateLimiterRedis from the rate-limiter-flexible package. This implementation stores counters in Redis using the connection string defined in config.REDIS_RATE_LIMIT_URL, ensuring that rate limit state persists across API node restarts and remains consistent in distributed deployments.
The limiter creates separate Redis keys for each API mode—crawl, scrape, search, and map—allowing granular control over resource consumption patterns. Each mode maintains independent counters that track requests consumed within a 60-second sliding window.
Mode-Specific Rate Limit Defaults
When a client plan does not specify custom quotas, the fallbackRateLimits object supplies default protection thresholds. According to the source in apps/api/src/services/rate-limiter.ts (lines 19-33), these defaults include:
- 100 scrape requests per minute for single-page extraction operations
- 15 crawl requests per minute for multi-page crawling jobs
- Dynamic minimums of 100 req/min for high-impact modes like
searchandscrape(lines 40-44)
The getRateLimiter function applies Math.max(rateLimit, 100) for these intensive operations, preventing accidental throttling that could degrade user experience during bulk operations.
Handling 429 Rate Limit Errors
When the limiter blocks a request, the auth controller in apps/api/src/controllers/auth.ts returns a structured HTTP 429 response. The error payload includes:
- A descriptive message indicating "Rate limit exceeded"
- Consumed points and remaining quota for the current window
- A
retry-aftertimestamp indicating when the client can safely resume requests
Client implementations should parse the retry-after header (returned in seconds) and implement exponential back-off strategies to handle these gracefully.
Robots.txt Compliance in Firecrawl
Firecrawl implements comprehensive robots.txt validation to ensure ethical crawling practices. The system fetches, parses, and enforces crawler directives before initiating any network request to a target domain.
Fetching and Parsing robots.txt
The robots.txt handling logic lives in apps/api/src/lib/robots-txt.ts. The fetchRobotsTxt function constructs the robots.txt URL by appending /robots.txt to the target origin, then retrieves the file using the standard Firecrawl scrape pipeline with an 8-second timeout.
Once fetched, the createRobotsChecker function (line 27) utilizes the robots-parser library to parse the raw text into a structured Robot object. This object exposes the isAllowed(url, userAgent) method used for subsequent validation checks.
URL Validation with isUrlAllowedByRobots
The critical compliance check occurs in isUrlAllowedByRobots (lines 35-76 of apps/api/src/lib/robots-txt.ts). This function validates URLs against the parsed robots.txt rules using Firecrawl's standard user-agents: FireCrawlAgent and FirecrawlAgent.
The implementation includes a trailing-slash fallback mechanism to handle "Disallow: /path/" rules that might otherwise miss variations of the same path. If the primary check returns ambiguous results, the function appends a trailing slash and re-evaluates to ensure comprehensive coverage of directory-level disallow directives.
Integration Points in the Crawler
Robots.txt enforcement occurs at two critical integration points:
-
Single-page scraper:
apps/api/src/scraper/scrapeURL/index.tsinvokesisUrlAllowedByRobotsbefore executing the scrape request. If the URL violates robots.txt directives, the scraper returns an HTTP 403 Forbidden response immediately, preventing unnecessary network traffic. -
Multi-page crawler:
apps/api/src/WebScraper/crawler.tsperforms the same validation during the crawling loop. When encountering disallowed URLs, the crawler skips them and logs the restriction, maintaining compliance throughout the entire crawl job.
Practical Implementation Examples
Handling 429 Rate Limit Errors in Client Code
When integrating with the Firecrawl API, implement retry logic that respects the retry-after header:
import fetch from "node-fetch";
async function scrapePage(url: string) {
const resp = await fetch(`https://api.firecrawl.dev/v1/scrape?url=${encodeURIComponent(url)}`, {
headers: { Authorization: `Bearer ${process.env.FIRECRAWL_API_KEY}` },
});
if (resp.status === 429) {
const body = await resp.json();
console.warn("Rate limit hit:", body.error);
// Back‑off for the suggested retry-after time
const retryAfter = parseInt(resp.headers.get("retry-after") ?? "60", 10) * 1000;
await new Promise(r => setTimeout(r, retryAfter));
return scrapePage(url); // retry
}
return resp.json();
}
Checking Robots.txt Compliance Programmatically
For custom implementations requiring pre-flight checks, utilize the internal validation functions:
import { fetchRobotsTxt, createRobotsChecker, isUrlAllowedByRobots } from "firecrawl";
async function canScrape(url: string) {
const { content, url: robotsUrl } = await fetchRobotsTxt(
{ url, zeroDataRetention: true },
"demo-scrape",
console,
);
const checker = createRobotsChecker(url, content);
const allowed = isUrlAllowedByRobots(url, checker.robots);
console.log(`URL ${url} is ${allowed ? "allowed" : "blocked"} by ${robotsUrl}`);
return allowed;
}
Customizing Rate Limit Quotas Server-Side
When self-hosting Firecrawl or extending the API, adjust rate limits based on subscription tiers:
// In a request‑handler (e.g., inside auth middleware)
import { getRateLimiter } from "firecrawl";
function getLimiterForPlan(plan) {
const limits = plan?.rate_limits ?? null; // pulled from DB
return getRateLimiter(RateLimiterMode.Scrape, limits);
}
Summary
- Rate limiting in Firecrawl relies on a Redis-backed
RateLimiterRedisinstance that tracks consumption per API mode (crawl, scrape, search) with 60-second windows, defaulting to 100 req/min for scrapes and 15 req/min for crawls. - 429 errors include detailed payloads with consumed points, remaining quota, and
retry-aftertimestamps, enabling clients to implement intelligent back-off strategies. - Robots.txt compliance is enforced via
apps/api/src/lib/robots-txt.ts, which fetches, parses, and validates URLs againstFireCrawlAgentandFirecrawlAgentuser-agents before any network request. - 403 responses are returned immediately when
isUrlAllowedByRobotsdetects a violation, preventing unnecessary traffic and ensuring ethical crawling practices across both single-page scrapes and multi-page crawls.
Frequently Asked Questions
What HTTP status code does Firecrawl return when rate limits are exceeded?
Firecrawl returns HTTP 429 Too Many Requests when the Redis-backed rate limiter blocks a request. The response body includes a descriptive error message, the number of consumed points, remaining quota for the current 60-second window, and a retry-after timestamp indicating when the client can safely resume requests.
How does Firecrawl handle robots.txt directives for different user agents?
Firecrawl validates URLs against robots.txt rules using the standard user-agents FireCrawlAgent and FirecrawlAgent as defined in apps/api/src/lib/robots-txt.ts. The isUrlAllowedByRobots function checks both user-agent strings against the parsed robots.txt rules and includes a trailing-slash fallback mechanism to ensure comprehensive coverage of directory-level disallow directives.
Can I customize rate limit quotas for different API plans in Firecrawl?
Yes, when self-hosting or extending Firecrawl, you can customize rate limits by passing custom quota objects to the getRateLimiter function in apps/api/src/services/rate-limiter.ts. The system supports plan-specific rate limits pulled from a database, with fallback defaults of 100 requests per minute for scrape operations and 15 requests per minute for crawl operations when no custom limits are provided.
What happens when Firecrawl encounters a URL blocked by robots.txt?
When Firecrawl detects a robots.txt violation through the isUrlAllowedByRobots function, it immediately returns an HTTP 403 Forbidden response without making any network request to the target URL. This enforcement occurs in both the single-page scraper (apps/api/src/scraper/scrapeURL/index.ts) and the multi-page crawler (apps/api/src/WebScraper/crawler.ts), ensuring consistent compliance across all crawling operations.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →