How to Design a Social Network Data Structure for Efficient Querying: A Complete System Design Guide
Combine sharded graph storage with breadth-first search (BFS) traversal across distributed Person Servers to find shortest connection paths between users in sub-second latency.
Designing a social network data structure for efficient querying requires handling massive unweighted graphs—hundreds of millions of users and billions of friendship edges—while maintaining low-latency pathfinding operations. According to the donnemartin/system-design-primer repository, a production-grade solution partitions the graph across sharded storage nodes and employs a microservices architecture to orchestrate distributed BFS queries. This design pattern supports horizontal scalability and high availability for read-heavy workloads exceeding 400 requests per second.
Clarifying Use Cases and Scale Constraints
Before implementing the data structure, define the operational boundaries. The System Design Primer assumes a graph with ~100 million users and an average of 50 friends per user, generating approximately 5 billion unweighted edges. The primary query requirement is finding the shortest path between two user IDs, while non-functional goals demand sub-second response times, high availability, and horizontal scalability across distributed infrastructure.
Core Architecture for Distributed Graph Queries
The solution separates concerns into discrete microservices to isolate bottlenecks and enable independent scaling. The request flows through the following components:
- Client → Web Server / Reverse Proxy: Accepts HTTP requests (e.g.,
GET /api/v1/friend_search?person_id=123). - Search API: Validates input parameters and forwards requests to the graph service.
- User Graph Service: Orchestrates the BFS traversal algorithm and manages query state.
- Lookup Service: Maintains a directory mapping
person_idto the specific Person Server holding that user's data. - Person Server(s): Sharded key-value stores that persist user profiles and adjacency lists.
- Memory Cache (Redis/Memcached): Stores hot
Personobjects and partial BFS results to reduce storage lookup latency.
This architecture is documented in solutions/system_design/social_graph/README.md within the repository.
Data Model and Sharding Strategy
The Person Entity
The fundamental data unit is the Person object, which contains an immutable identifier, display name, and a list of friend IDs representing graph edges.
class Person(object):
def __init__(self, id, name, friend_ids):
self.id = id
self.name = name
self.friend_ids = friend_ids
Sharded Storage with Person Servers
To ensure the graph fits in memory across a cluster, the design employs range-based or hash-based sharding. Each PersonServer instance manages a subset of the user base in a simple key-value structure:
class PersonServer(object):
def __init__(self):
self.people = {} # id → Person
def add_person(self, person):
self.people[person.id] = person
def get_people(self, ids):
return [self.people[i] for i in ids if i in self.people]
Lookup Service for Shard Resolution
The LookupService acts as a directory that resolves any person_id to its responsible PersonServer. This adds a single extra hop to queries but enables the graph to scale horizontally across hundreds of nodes.
class LookupService(object):
def __init__(self):
self.lookup = self._init_lookup() # id → PersonServer
def lookup_person_server(self, person_id):
return self.lookup[person_id]
Implementing the Shortest Path Query Algorithm
Breadth-First Search (BFS) Traversal
Because friendship edges are unweighted, the shortest path between two users is found using BFS traversal. The UserGraphService implements this algorithm across distributed shards by fetching adjacency lists from PersonServer instances via the LookupService.
import collections
class UserGraphService(object):
def __init__(self, lookup_service):
self.lookup_service = lookup_service
def shortest_path(self, source_key, dest_key):
if source_key is None or dest_key is None:
return None
if source_key == dest_key:
return [source_key]
prev = {}
visited = {source_key}
queue = collections.deque([source_key])
while queue:
cur = queue.popleft()
if cur == dest_key:
# Reconstruct path by backtracking through predecessors
path = [dest_key]
while path[-1] != source_key:
path.append(prev[path[-1]])
return list(reversed(path))
# Fetch current person and their friends from sharded storage
cur_server = self.lookup_service.lookup_person_server(cur)
cur_person = cur_server.get_people([cur])[0]
for friend_id in cur_person.friend_ids:
if friend_id not in visited:
visited.add(friend_id)
prev[friend_id] = cur
queue.append(friend_id)
return None
Optimizing with Bidirectional BFS
For large graph depths, the design supports bidirectional BFS—simultaneously searching from both the source and destination users. This technique halves the number of explored nodes and significantly reduces memory consumption and query latency under heavy load.
Scaling the Design for Production Traffic
The solutions/system_design/social_graph/README.md file outlines specific strategies to address performance bottlenecks:
| Bottleneck | Mitigation Strategy |
|---|---|
| Lookup latency | Cache the person_id → server mapping in an in-memory store; batch lookups for multiple IDs. |
| Hot user data | Deploy Redis or Memcached to store frequently accessed Person objects and pre-computed BFS results. |
| Cross-network traffic | Co-locate users with geographic or social locality on the same shard; implement shard-by-location strategies. |
| Read-heavy load | Add read replicas of PersonServer instances and employ read-through caching patterns. |
| Deep BFS traversal | Implement bidirectional BFS to reduce explored node count by orders of magnitude. |
| Traffic spikes | Deploy auto-scaling groups behind load balancers with rate limiting and graceful degradation policies. |
Key Trade-offs in Social Graph Design
Complexity versus Performance: Introducing sharding, caching layers, and bidirectional search algorithms increases operational complexity but is necessary to achieve sub-second latency at scale.
Consistency Model: The architecture favors eventual consistency for read operations, serving data from cache. Write operations—such as new friendship creations—must update the underlying shard and invalidate relevant cache entries to maintain data accuracy.
Infrastructure Cost: Running hundreds of sharded PersonServer nodes with replica sets and dedicated cache clusters increases infrastructure overhead. However, this cost is required to meet strict availability and latency SLAs for global user bases.
Summary
- Shard the graph across
PersonServernodes using range or hash-based partitioning to handle billions of edges. - Implement a
LookupServiceto map user IDs to their respective shards, enabling distributed query execution. - Use BFS traversal for unweighted shortest-path queries, with bidirectional optimization for deep graph searches.
- Cache aggressively at the lookup layer and data layer to minimize cross-service latency.
- Design for horizontal scalability using microservices that can independently scale based on read/write patterns.
Frequently Asked Questions
Why use BFS instead of DFS for social network queries?
BFS guarantees the shortest path in unweighted graphs by exploring all nodes at the present depth before moving deeper. DFS could explore extremely long paths first, potentially missing shorter connections and wasting computational resources on large social graphs.
How do you handle hot spots or celebrity users in the sharded design?
Celebrity users with millions of followers are mitigated through aggressive caching in the memory layer (Redis/Memcached). Additionally, read replicas of the PersonServer holding high-profile users distribute query load, while the LookupService can implement special routing logic for known hot keys.
What is the purpose of the Lookup Service in this architecture?
The LookupService maintains a directory mapping person_id to the specific PersonServer shard containing that user's data. This indirection enables the graph to scale horizontally beyond single-machine memory limits while providing a consistent interface for the UserGraphService to locate distributed adjacency lists during BFS traversal.
How does bidirectional BFS improve query performance?
Bidirectional BFS runs two simultaneous searches—one from the source user and one from the destination—stopping when they intersect. This approach reduces the number of visited nodes from O(b^d) to O(b^(d/2)), where b is the branching factor and d is the path depth, dramatically decreasing memory usage and query latency for distant connections.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →