# How to Design a Scalable Web Crawler: Architecture & Implementation Guide

> Learn to design a scalable web crawler with our architecture and implementation guide. Discover how to handle billions of pages using a distributed queue-based system.

- Repository: [Donne Martin/system-design-primer](https://github.com/donnemartin/system-design-primer)
- Tags: how-to-guide
- Published: 2026-02-24

---

**A scalable web crawler uses a distributed queue-based architecture with Redis-backed priority scheduling, signature-based deduplication, and separate services for crawling, reverse indexing, and document storage to handle billions of pages and high-throughput search queries.**

Designing a scalable web crawler requires balancing throughput, storage efficiency, and politeness while processing billions of URLs monthly. According to the `donnemartin/system-design-primer` repository, a production-grade crawler must generate reverse indexes, serve static content snippets, and maintain high availability under massive load. This guide breaks down the architecture, core components, and implementation patterns found in the `solutions/system_design/web_crawler/` directory.

## Core Requirements and Constraints

A scalable web crawler design must satisfy strict operational constraints. The [`README.md`](https://github.com/donnemartin/system-design-primer/blob/main/README.md) in the web crawler solution specifies handling approximately **4 billion links per month** (generating ~100 TB of raw content) and serving **100 billion search queries monthly**.

Key requirements include:

*   **High Availability**: No single point of failure across crawler workers, queues, and storage layers.
*   **Priority Scheduling**: URLs must be crawled by priority score, not FIFO order.
*   **Duplicate Detection**: Prevent redundant processing of identical content using content-derived signatures.
*   **Politeness**: Respect [`robots.txt`](https://github.com/donnemartin/system-design-primer/blob/main/robots.txt) and per-host rate limits to avoid overwhelming source servers.

## High-Level Architecture Overview

The architecture separates concerns into distinct services communicating via message queues. According to the `web_crawler` solution, the system comprises four primary layers:

1.  **Crawler Service**: Distributed workers that fetch pages, extract links, and enqueue indexing tasks.
2.  **Reverse Index Service**: Consumes crawled content to build word-to-page mappings.
3.  **Document Service**: Generates and stores static titles and snippets for fast search retrieval.
4.  **Query Service**: Handles user search requests against the pre-computed indexes.

The data flow relies on **Redis** for priority queue management and **distributed message queues** (SQS, Kafka, or RabbitMQ) for decoupling components. The `PagesDataStore` class abstracts Redis interactions, utilizing sorted sets for priority scheduling and hash sets for URL deduplication.

## Deep Dive: The PagesDataStore Implementation

The `PagesDataStore` class in [`web_crawler_snippets.py`](https://github.com/donnemartin/system-design-primer/blob/main/web_crawler_snippets.py) serves as the abstraction layer for crawl state management. It wraps Redis commands to maintain two critical data structures: `links_to_crawl` (a sorted set for priority) and `crawled_links` (a hash set for deduplication).

### Priority Queue Management

The implementation uses Redis sorted sets (`ZADD`, `ZREM`, `ZREVRANGE`) to ensure the crawler always processes the highest-priority URL available:

```python
import redis

class PagesDataStore:
    def __init__(self, host='localhost'):
        self.r = redis.StrictRedis(host=host)

    def add_link_to_crawl(self, url, priority=0):
        """Add URL with priority score using ZADD"""
        self.r.zadd('links_to_crawl', {url: priority})

    def extract_max_priority_page(self):
        """Pop highest priority URL using ZREVRANGE"""
        result = self.r.zrevrange('links_to_crawl', 0, 0, withscores=True)
        if not result:
            return None
        url, score = result[0]
        return url

    def remove_link_to_crawl(self, url):
        """Remove processed URL using ZREM"""
        self.r.zrem('links_to_crawl', url)

```

This design ensures O(log N) operations for insertions and extractions, critical when managing millions of pending URLs.

### Deduplication via Content Signatures

To prevent infinite loops and redundant crawling, the store maintains content signatures (SHA-256 hashes of URL + HTML content):

```python
    def insert_crawled_link(self, url, signature):
        """Store signature in hash set"""
        self.r.hset('crawled_links', url, signature)

    def crawled_similar(self, signature):
        """Check if signature exists in any crawled link"""
        return any(sig == signature for sig in self.r.hvals('crawled_links'))

```

## Crawler Logic and Worker Implementation

The `Crawler` class orchestrates the crawling workflow. It continuously polls `PagesDataStore` for high-priority pages, respects politeness rules, and fans out indexing tasks.

### Main Crawl Loop

As implemented in [`web_crawler_snippets.py`](https://github.com/donnemartin/system-design-primer/blob/main/web_crawler_snippets.py), the core logic follows this pattern:

```python
def run_crawler(seed_urls):
    store = PagesDataStore()
    for url in seed_urls:
        store.add_link_to_crawl(url, priority=0)
    
    crawler = Crawler(
        data_store=store,
        reverse_index_queue=ReverseIndexQueue(),
        doc_index_queue=DocumentQueue(),
    )
    crawler.crawl()

class Crawler:
    def crawl(self):
        while True:
            url = self.data_store.extract_max_priority_page()
            if not url:
                break
                
            page = self.download_page(url)
            
            # Duplicate detection

            if self.data_store.crawled_similar(page.signature):
                self.data_store.reduce_priority(url)
                continue
                
            # Process valid page

            self.data_store.insert_crawled_link(url, page.signature)
            self.extract_links(page)
            self.enqueue_for_indexing(page)

```

Each worker maintains connection-pooled HTTP clients and respects per-host crawl delays to prevent overwhelming target servers.

## Handling Massive URL Deduplication with MapReduce

When ingesting billions of seed URLs, in-memory deduplication becomes prohibitive. The repository provides a `RemoveDuplicateUrls` MapReduce job in [`web_crawler_mapreduce.py`](https://github.com/donnemartin/system-design-primer/blob/main/web_crawler_mapreduce.py) using the `mrjob` library.

This batch job emits each URL with a count of 1, aggregates counts across the dataset, and retains only unique entries:

```bash
python web_crawler_mapreduce.py urls.txt > unique_urls.txt

```

The Mapper emits `(url, 1)` for each input line. The Reducer sums counts and outputs URLs where total count equals 1. Running this nightly on seed lists prevents crawler workers from wasting bandwidth on duplicate targets.

## Scaling Strategies for Production

The [`README.md`](https://github.com/donnemartin/system-design-primer/blob/main/README.md) outlines specific strategies to handle the target scale of 4 billion links and 40,000 RPS search traffic:

**Horizontal Scaling of Crawler Workers**

Deploy crawler workers across multiple availability zones behind a load balancer. Use connection pooling and async I/O to maximize throughput for the ~1 million requests per second required to meet the 4B links/month target.

**Storage Sharding**

Shard the reverse index and document store horizontally. Store raw HTML blobs in object storage (S3) while keeping indexes in distributed NoSQL databases like Cassandra or DynamoDB.

**Caching Hot Queries**

Deploy Redis or Memcached in front of the Reverse Index Service to cache top-N search results. This mitigates the 40,000 RPS read requirement and reduces latency for common queries.

**Distributed Queue Resilience**

Use Kafka or SQS with multiple partitions to ensure the message queues are not a single point of failure. This provides back-pressure control when the indexing pipeline slows.

**Robots.txt Compliance**

Maintain a fast key-value store (Redis) caching [`robots.txt`](https://github.com/donnemartin/system-design-primer/blob/main/robots.txt) rules per domain. Enforce per-host rate limiting to ensure politeness and avoid IP bans.

## Summary

*   **Architecture**: Separate concerns into Crawler, Reverse Index, Document, and Query services connected by distributed queues.
*   **Priority Management**: Use Redis sorted sets via `PagesDataStore` to ensure high-priority URLs are processed first using `ZADD` and `ZREVRANGE`.
*   **Deduplication**: Implement content signatures in `crawled_links` hash sets to prevent redundant crawling, supplemented by MapReduce batch jobs for initial seed cleaning.
*   **Scaling**: Achieve 4B links/month throughput via horizontal worker scaling, storage sharding, and aggressive caching for 40k RPS query loads.
*   **Implementation**: Reference [`web_crawler_snippets.py`](https://github.com/donnemartin/system-design-primer/blob/main/web_crawler_snippets.py) for the core Python classes and [`web_crawler_mapreduce.py`](https://github.com/donnemartin/system-design-primer/blob/main/web_crawler_mapreduce.py) for large-scale URL deduplication.

## Frequently Asked Questions

### What data structure is best for prioritizing URLs in a scalable web crawler?

**Redis sorted sets (ZSETs) provide the optimal data structure for URL prioritization.** As implemented in [`web_crawler_snippets.py`](https://github.com/donnemartin/system-design-primer/blob/main/web_crawler_snippets.py), the `PagesDataStore` class uses `ZADD` to insert URLs with priority scores and `ZREVRANGE` to extract the highest-priority item in O(log N) time. This allows the crawler to prioritize fresh or high-authority content over stale links.

### How do you prevent a distributed web crawler from processing the same URL multiple times?

**Implement signature-based deduplication using content hashing.** The crawler computes a SHA-256 signature of the URL and HTML content, storing it in a Redis hash set (`crawled_links`). Before processing, it checks `crawled_similar()` to detect duplicates. For initial seed lists containing billions of URLs, run the `RemoveDuplicateUrls` MapReduce job from [`web_crawler_mapreduce.py`](https://github.com/donnemartin/system-design-primer/blob/main/web_crawler_mapreduce.py) to filter duplicates before crawling begins.

### What throughput is required to crawl 4 billion links per month?

**Approximately 1,600 requests per second sustained, or higher with burst capacity.** The repository calculates 4 billion links per month as roughly 100 TB of raw content. To handle this volume, the architecture requires horizontal scaling of crawler workers, connection pooling, and async I/O. The design targets ~1 million requests per second capacity during peak operations to accommodate retries and politeness delays.

### How do you ensure a web crawler respects robots.txt while maintaining speed?

**Cache robots.txt rules in a fast key-value store and enforce per-host rate limiting.** The crawler maintains a Redis cache of domain-specific crawl rules and implements token bucket or sliding window rate limiters per host. This prevents overwhelming target servers while allowing maximum throughput for diverse domains, ensuring compliance with the politeness requirements specified in the system design.