# How to Design Amazon's Sales Ranking by Category Feature: A Complete System Design Guide

> Design Amazon's sales ranking by category feature using MapReduce and a REST API. Learn to handle 100 billion monthly reads efficiently with a complete system design guide.

- Repository: [Donne Martin/system-design-primer](https://github.com/donnemartin/system-design-primer)
- Tags: how-to-guide
- Published: 2026-02-24

---

**To design Amazon's sales ranking by category feature, implement a batch-processing pipeline using MapReduce to aggregate weekly sales data into a relational database, then serve rankings through a REST API with a memory cache layer to handle 100 billion monthly reads.**

Designing Amazon's sales ranking by category feature requires handling approximately 100 billion reads per month while processing only 400 writes per second. The donnemartin/system-design-primer repository provides a proven architecture that separates heavy computation from high-traffic serving to achieve this scale efficiently.

## Architecture Overview

The design follows a **batch-processing plus online-serving** pattern detailed in [`solutions/system_design/sales_rank/README.md`](https://github.com/donnemartin/system-design-primer/blob/main/solutions/system_design/sales_rank/README.md). Raw sales data flows through MapReduce jobs that pre-compute rankings, which are then served via cached API endpoints to millions of users.

## Data Ingestion and Batch Processing

### Storing Raw Sales Events

Raw sales logs are written to an **object store** such as Amazon S3 rather than a custom file system. This decouples data ingestion from processing and provides durability for historical analysis while keeping costs low for high-volume log storage.

### MapReduce Aggregation

A **MapReduce** job processes the S3 logs to aggregate quantities sold per category and product pair for the past week. According to the source code analysis, the implementation uses Python with the MRJob library:

```python
class SalesRanker(MRJob):
    def within_past_week(self, timestamp):
        # implement time-window check

        ...

    def mapper(self, _, line):
        # tab-delimited log entry

        timestamp, product_id, category_id, quantity, *_ = line.split('\t')
        if self.within_past_week(timestamp):
            yield (category_id, product_id), int(quantity)

    def reducer(self, key, values):
        # sum quantities for each (category, product)

        yield key, sum(values)

    def mapper_sort(self, key, value):
        # create sort key (category, total_sold)

        category_id, product_id = key
        yield (category_id, value), product_id

    def reducer_identity(self, key, values):
        # final output after distributed sort

        yield key, list(values)[0]

    def steps(self):
        return [
            self.mr(mapper=self.mapper, reducer=self.reducer),
            self.mr(mapper=self.mapper_sort,
                    reducer=self.reducer_identity)
        ]

```

The mapper extracts timestamps and yields key-value pairs of `(category_id, product_id)` and quantity. The reducer sums these values, and a secondary sort organizes results by category and total sold before writing to the database.

## Online Serving Layer

### REST API Endpoint

The system exposes a **REST API** to deliver pre-computed rankings. The endpoint follows this specification:

```http
GET https://amazon.com/api/v1/popular?category_id=1234

```

```json
[
  {"id":"100","category_id":"1234","total_sold":"100000","product_id":"50"},
  {"id":"53","category_id":"1234","total_sold":"90000","product_id":"200"},
  {"id":"75","category_id":"1234","total_sold":"80000","product_id":"3"}
]

```

### Relational Database Schema

Aggregated rankings are stored in a relational **sales_rank** table optimized for read access:

```sql
CREATE TABLE sales_rank (
    id          INT AUTO_INCREMENT PRIMARY KEY,
    category_id INT NOT NULL,
    total_sold  INT NOT NULL,
    product_id  INT NOT NULL,
    INDEX idx_category (category_id),
    INDEX idx_product  (product_id)
);

```

The composite indexes on `category_id` and `product_id` support efficient lookup and join operations for the API layer.

## Scaling for High Traffic

### Memory Caching Layer

Because reads dominate the workload, a **memory cache** such as Redis or Memcached stores hot ranking data. The system uses a **cache-aside** pattern where reads first check the cache, falling back to database read replicas on misses. Cache entries use TTL or refresh-ahead strategies to balance freshness with performance.

### Database Scaling Strategies

To survive traffic spikes, the architecture employs:

- **SQL read replicas** to distribute query load away from the master database
- **Sharding** for the sales_rank table to improve write throughput, though this adds complexity for routing and rebalancing across database nodes

### Infrastructure Components

A **load balancer** distributes requests across multiple web servers acting as reverse proxies. This horizontal scaling ensures the system handles peak traffic without single points of failure while maintaining the modest 400 writes per second capacity.

## Key Design Trade-offs

The implementation in [`solutions/system_design/sales_rank/README.md`](https://github.com/donnemartin/system-design-primer/blob/main/solutions/system_design/sales_rank/README.md) evaluates several critical trade-offs:

**Write Scalability**: A single master-slave setup offers operational simplicity, while sharding improves write throughput but requires complex routing logic and rebalancing operations when data grows.

**Read Latency**: Direct database queries provide strong consistency but higher latency; adding a cache achieves sub-millisecond response times at the cost of managing cache invalidation through TTL or refresh-ahead mechanisms.

**Data Freshness**: Hourly batch processing fits the "past week" ranking requirement with simple, fault-tolerant operations; real-time streaming would increase operational cost and complexity without significant business benefit for this use case.

**Consistency**: Eventual consistency from read replicas is acceptable for displaying sales rankings, whereas strong consistency would require reading from the master database and create insurmountable bottlenecks under 100 billion monthly reads.

## Summary

- **Batch processing** using MapReduce on S3 logs handles computational heavy lifting outside the critical serving path
- **Pre-computed rankings** stored in relational databases with proper indexing enable fast category lookups via `/api/v1/popular`
- **Cache-aside pattern** with Redis or Memcached protects the database from read traffic and delivers sub-millisecond latency
- **Read replicas and sharding** provide horizontal scalability for the serving layer while managing the modest write load
- **Eventual consistency** is architecturally acceptable for this feature, allowing optimized read performance over strict consistency

## Frequently Asked Questions

### Why use batch processing instead of real-time streaming for Amazon's sales ranking?

Hourly batch processing is chosen because sales rankings aggregate data over the past week, making minute-by-minute updates unnecessary. Batch jobs are simpler to implement, debug, and operate than real-time streaming pipelines, and they naturally fit the MapReduce paradigm for large-scale aggregation as implemented in the system-design-primer solution.

### How does the system handle hot categories with extremely high read traffic?

Hot categories are managed through the memory cache layer using Redis or Memcached. The cache-aside pattern ensures that popular category rankings are served directly from memory rather than hitting the database. Multiple cache servers can be deployed to distribute load, with cache misses falling back to read replicas rather than the master database.

### What triggers updates to the sales_rank table and cache invalidation?

The MapReduce job runs on a scheduled basis (e.g., hourly) to recalculate weekly sales totals. Upon completion, it writes new rankings to the sales_rank table. Cache invalidation occurs through TTL (time-to-live) expiration or refresh-ahead strategies where the system updates cache entries before they expire, ensuring users see current data without cache stampedes.

### Why is a relational database used instead of NoSQL for storing rankings?

A relational database is chosen for the sales_rank table because the data is naturally structured with fixed schemas (category_id, product_id, total_sold) and requires ACID compliance for batch updates. While NoSQL options like Redshift or BigQuery are mentioned for long-term analytics warehousing, the online serving layer benefits from the indexing capabilities and transaction support of SQL databases for point lookups by category.