# How to Design a Social Network Data Structure for Efficient Querying: A Complete System Design Guide

> Design an efficient social network data structure using sharded graph storage and BFS traversal for sub-second user connection path querying across distributed servers.

- Repository: [Donne Martin/system-design-primer](https://github.com/donnemartin/system-design-primer)
- Tags: tutorial
- Published: 2026-02-24

---

**Combine sharded graph storage with breadth-first search (BFS) traversal across distributed Person Servers to find shortest connection paths between users in sub-second latency.**

Designing a social network data structure for efficient querying requires handling massive unweighted graphs—hundreds of millions of users and billions of friendship edges—while maintaining low-latency pathfinding operations. According to the `donnemartin/system-design-primer` repository, a production-grade solution partitions the graph across sharded storage nodes and employs a microservices architecture to orchestrate distributed BFS queries. This design pattern supports horizontal scalability and high availability for read-heavy workloads exceeding 400 requests per second.

## Clarifying Use Cases and Scale Constraints

Before implementing the data structure, define the operational boundaries. The System Design Primer assumes a graph with **~100 million users** and an average of **50 friends per user**, generating approximately **5 billion unweighted edges**. The primary query requirement is finding the shortest path between two user IDs, while non-functional goals demand sub-second response times, high availability, and horizontal scalability across distributed infrastructure.

## Core Architecture for Distributed Graph Queries

The solution separates concerns into discrete microservices to isolate bottlenecks and enable independent scaling. The request flows through the following components:

- **Client → Web Server / Reverse Proxy**: Accepts HTTP requests (e.g., `GET /api/v1/friend_search?person_id=123`).
- **Search API**: Validates input parameters and forwards requests to the graph service.
- **User Graph Service**: Orchestrates the BFS traversal algorithm and manages query state.
- **Lookup Service**: Maintains a directory mapping `person_id` to the specific **Person Server** holding that user's data.
- **Person Server(s)**: Sharded key-value stores that persist user profiles and adjacency lists.
- **Memory Cache (Redis/Memcached)**: Stores hot `Person` objects and partial BFS results to reduce storage lookup latency.

This architecture is documented in [`solutions/system_design/social_graph/README.md`](https://github.com/donnemartin/system-design-primer/blob/main/solutions/system_design/social_graph/README.md) within the repository.

## Data Model and Sharding Strategy

### The Person Entity

The fundamental data unit is the `Person` object, which contains an immutable identifier, display name, and a list of friend IDs representing graph edges.

```python
class Person(object):
    def __init__(self, id, name, friend_ids):
        self.id = id
        self.name = name
        self.friend_ids = friend_ids

```

### Sharded Storage with Person Servers

To ensure the graph fits in memory across a cluster, the design employs **range-based or hash-based sharding**. Each `PersonServer` instance manages a subset of the user base in a simple key-value structure:

```python
class PersonServer(object):
    def __init__(self):
        self.people = {}  # id → Person

    def add_person(self, person):
        self.people[person.id] = person

    def get_people(self, ids):
        return [self.people[i] for i in ids if i in self.people]

```

### Lookup Service for Shard Resolution

The `LookupService` acts as a directory that resolves any `person_id` to its responsible `PersonServer`. This adds a single extra hop to queries but enables the graph to scale horizontally across hundreds of nodes.

```python
class LookupService(object):
    def __init__(self):
        self.lookup = self._init_lookup()  # id → PersonServer

    def lookup_person_server(self, person_id):
        return self.lookup[person_id]

```

## Implementing the Shortest Path Query Algorithm

### Breadth-First Search (BFS) Traversal

Because friendship edges are unweighted, the shortest path between two users is found using **BFS traversal**. The `UserGraphService` implements this algorithm across distributed shards by fetching adjacency lists from `PersonServer` instances via the `LookupService`.

```python
import collections

class UserGraphService(object):
    def __init__(self, lookup_service):
        self.lookup_service = lookup_service

    def shortest_path(self, source_key, dest_key):
        if source_key is None or dest_key is None:
            return None
        if source_key == dest_key:
            return [source_key]

        prev = {}
        visited = {source_key}
        queue = collections.deque([source_key])

        while queue:
            cur = queue.popleft()
            if cur == dest_key:
                # Reconstruct path by backtracking through predecessors

                path = [dest_key]
                while path[-1] != source_key:
                    path.append(prev[path[-1]])
                return list(reversed(path))

            # Fetch current person and their friends from sharded storage

            cur_server = self.lookup_service.lookup_person_server(cur)
            cur_person = cur_server.get_people([cur])[0]
            
            for friend_id in cur_person.friend_ids:
                if friend_id not in visited:
                    visited.add(friend_id)
                    prev[friend_id] = cur
                    queue.append(friend_id)
        return None

```

### Optimizing with Bidirectional BFS

For large graph depths, the design supports **bidirectional BFS**—simultaneously searching from both the source and destination users. This technique halves the number of explored nodes and significantly reduces memory consumption and query latency under heavy load.

## Scaling the Design for Production Traffic

The [`solutions/system_design/social_graph/README.md`](https://github.com/donnemartin/system-design-primer/blob/main/solutions/system_design/social_graph/README.md) file outlines specific strategies to address performance bottlenecks:

| Bottleneck | Mitigation Strategy |
|------------|---------------------|
| **Lookup latency** | Cache the `person_id → server` mapping in an in-memory store; batch lookups for multiple IDs. |
| **Hot user data** | Deploy Redis or Memcached to store frequently accessed `Person` objects and pre-computed BFS results. |
| **Cross-network traffic** | Co-locate users with geographic or social locality on the same shard; implement shard-by-location strategies. |
| **Read-heavy load** | Add read replicas of `PersonServer` instances and employ read-through caching patterns. |
| **Deep BFS traversal** | Implement bidirectional BFS to reduce explored node count by orders of magnitude. |
| **Traffic spikes** | Deploy auto-scaling groups behind load balancers with rate limiting and graceful degradation policies. |

## Key Trade-offs in Social Graph Design

**Complexity versus Performance**: Introducing sharding, caching layers, and bidirectional search algorithms increases operational complexity but is necessary to achieve sub-second latency at scale.

**Consistency Model**: The architecture favors **eventual consistency** for read operations, serving data from cache. Write operations—such as new friendship creations—must update the underlying shard and invalidate relevant cache entries to maintain data accuracy.

**Infrastructure Cost**: Running hundreds of sharded `PersonServer` nodes with replica sets and dedicated cache clusters increases infrastructure overhead. However, this cost is required to meet strict availability and latency SLAs for global user bases.

## Summary

- **Shard the graph** across `PersonServer` nodes using range or hash-based partitioning to handle billions of edges.
- **Implement a `LookupService`** to map user IDs to their respective shards, enabling distributed query execution.
- **Use BFS traversal** for unweighted shortest-path queries, with bidirectional optimization for deep graph searches.
- **Cache aggressively** at the lookup layer and data layer to minimize cross-service latency.
- **Design for horizontal scalability** using microservices that can independently scale based on read/write patterns.

## Frequently Asked Questions

### Why use BFS instead of DFS for social network queries?

BFS guarantees the shortest path in unweighted graphs by exploring all nodes at the present depth before moving deeper. DFS could explore extremely long paths first, potentially missing shorter connections and wasting computational resources on large social graphs.

### How do you handle hot spots or celebrity users in the sharded design?

Celebrity users with millions of followers are mitigated through **aggressive caching** in the memory layer (Redis/Memcached). Additionally, read replicas of the `PersonServer` holding high-profile users distribute query load, while the `LookupService` can implement special routing logic for known hot keys.

### What is the purpose of the Lookup Service in this architecture?

The `LookupService` maintains a directory mapping `person_id` to the specific `PersonServer` shard containing that user's data. This indirection enables the graph to scale horizontally beyond single-machine memory limits while providing a consistent interface for the `UserGraphService` to locate distributed adjacency lists during BFS traversal.

### How does bidirectional BFS improve query performance?

Bidirectional BFS runs two simultaneous searches—one from the source user and one from the destination—stopping when they intersect. This approach reduces the number of visited nodes from O(b^d) to O(b^(d/2)), where *b* is the branching factor and *d* is the path depth, dramatically decreasing memory usage and query latency for distant connections.