# How Caching is Implemented in GraphRAG Agent for Performance Optimization

> Discover how GraphRAG Agent optimizes performance with a dual-layer caching strategy, reducing LLM calls and transformer downloads for faster results.

- Repository: [GLK/graph-rag-agent](https://github.com/1517005260/graph-rag-agent)
- Tags: performance
- Published: 2026-02-23

---

**GraphRAG Agent employs a dual-layer caching strategy that combines an in-memory LRU/quality-aware request cache with a disk-based model file cache to eliminate redundant LLM calls and prevent repeated transformer downloads.**

The `1517005260/graph-rag-agent` repository implements sophisticated caching mechanisms to ensure high-throughput retrieval-augmented generation. By storing expensive query results in memory and pre-loading embedding models to local storage, the system minimizes latency for repeat requests while reducing network overhead during startup.

## In-Memory Request Cache

The primary caching layer resides in **[`server/utils/cache.py`](https://github.com/1517005260/graph-rag-agent/blob/main/server/utils/cache.py)**, where the `CacheManager` class provides a lightweight, thread-safe store for query results.

### Cache Architecture and Configuration

The cache initializes with configurable bounds to prevent memory bloat. Each entry persists for a default **TTL of 3600 seconds (1 hour)** and the store caps at **1000 entries** via the `max_size` parameter.

Cache keys follow a composite structure incorporating both the query string and an optional `thread_id`, formatted as `thread_id:query`. This design enables per-conversation isolation, ensuring that identical queries in different threads maintain separate cache entries.

```python

# server/utils/cache.py

class CacheManager:
    def __init__(self, max_size: int = 1000, ttl_seconds: int = 3600):
        self.cache: Dict[str, Dict[str, Any]] = {}
        self.max_size = max_size
        self.ttl_seconds = ttl_seconds

```

### Quality-Aware Eviction Policy

When the cache reaches capacity, the `_evict_cache` method removes entries based on a hybrid scoring system. The algorithm prioritizes **quality scores** (0.0 to 1.0) over recency, deleting the lowest-quality entry first. If quality scores are equivalent, the least-recently-used entry is evicted.

This quality metric allows the system to retain high-value responses—such as those validated by user feedback—while purging less reliable cached data.

```python
def _evict_cache(self) -> None:
    entries = [(k, v["last_access"], v["quality"]) for k, v in self.cache.items()]
    entries.sort(key=lambda x: (x[2], x[1]))       # low quality → old

    del self.cache[entries[0][0]]

```

### Global Singleton Pattern

A module-level singleton instance ensures cache consistency across the FastAPI service. The line `cache_manager = CacheManager()` creates a single shared state that is imported wherever caching is required, preventing cache fragmentation across different modules.

## Model File Cache

The secondary caching layer addresses the cold-start problem of large language models by persisting transformer files to disk. Implemented in **[`graphrag_agent/cache_manager/model_cache.py`](https://github.com/1517005260/graph-rag-agent/blob/main/graphrag_agent/cache_manager/model_cache.py)**, this system guarantees that embedding models are downloaded exactly once.

### Pre-loading Transformer Models

The `preload_sentence_transformer_models` function iterates over configured model names and instantiates each `SentenceTransformer` with a dedicated `cache_folder`. This forces the underlying HuggingFace libraries to write model weights to `MODEL_CACHE_DIR` rather than temporary storage.

```python

# graphrag_agent/cache_manager/model_cache.py

def preload_sentence_transformer_models(models: Optional[List[str]] = None) -> None:
    from sentence_transformers import SentenceTransformer
    cache_dir = ensure_model_cache_dir()
    for model_name in models:
        _ = SentenceTransformer(model_name, cache_folder=cache_dir)

```

The system distinguishes between provider types: OpenAI embeddings bypass local caching, while SentenceTransformer models trigger the pre-loading routine.

### Initialization at Startup

The `initialize_model_cache` function executes during the FastAPI startup event, ensuring all heavy models reside on local storage before the first request arrives. This eliminates download latency during active user sessions and prevents timeout errors when multiple concurrent requests attempt to access uncached models simultaneously.

```python

# In the FastAPI startup event

from graphrag_agent.cache_manager.model_cache import initialize_model_cache

@app.on_event("startup")
def on_startup():
    # Pre-download SentenceTransformer models and create cache dir

    initialize_model_cache()

```

## Implementation Examples

### Caching Query Results

To leverage the in-memory cache, import the singleton `cache_manager` and wrap expensive operations:

```python
from server.utils.cache import cache_manager

def get_answer(query: str, thread_id: str = None):
    # Try cached result first

    cached = cache_manager.get(query, thread_id)
    if cached is not None:
        return cached

    # Expensive operation (e.g., LLM call)

    answer = call_llm(query)

    # Store with a quality rating (0-1)

    cache_manager.set(query, answer, thread_id, quality=0.9)
    return answer

```

### Updating Cache Quality

After receiving user feedback, adjust the quality score to influence retention:

```python
def user_feedback(query: str, rating: float, thread_id: str = None):
    # rating is a float between 0.0 and 1.0

    cache_manager.update_quality(query, rating, thread_id)

```

## Summary

- **Dual-layer architecture**: Combines in-memory request caching ([`server/utils/cache.py`](https://github.com/1517005260/graph-rag-agent/blob/main/server/utils/cache.py)) with disk-based model caching ([`graphrag_agent/cache_manager/model_cache.py`](https://github.com/1517005260/graph-rag-agent/blob/main/graphrag_agent/cache_manager/model_cache.py)).
- **Quality-aware eviction**: The `CacheManager` prioritizes high-quality entries during cleanup, ensuring valuable responses persist longer than low-quality ones.
- **Conversation isolation**: Cache keys incorporate `thread_id` to prevent cross-contamination between different user sessions.
- **Startup optimization**: Model pre-loading occurs during FastAPI initialization, preventing download delays during active request processing.
- **Configurable limits**: Default settings cap the memory cache at 1000 entries with a 1-hour TTL, balancing performance against resource consumption.

## Frequently Asked Questions

### How does the cache key handle concurrent conversations?

The cache key format `thread_id:query` ensures per-conversation isolation. When `thread_id` is provided, identical queries from different threads generate distinct cache entries, preventing users from accessing cached responses intended for other sessions.

### What happens when the in-memory cache reaches its size limit?

When the cache exceeds `max_size` (default 1000), the `_evict_cache` method triggers a sorted deletion. It calculates a composite key of `(quality_score, last_access_time)`, removes the entry with the lowest quality first, and falls back to LRU (Least Recently Used) ordering only when quality scores are tied.

### Why does GraphRAG Agent use a disk cache for models instead of relying on HuggingFace's default cache?

The custom `MODEL_CACHE_DIR` implementation in [`graphrag_agent/cache_manager/model_cache.py`](https://github.com/1517005260/graph-rag-agent/blob/main/graphrag_agent/cache_manager/model_cache.py) ensures deterministic storage locations and explicit pre-loading during application startup. This prevents race conditions where multiple concurrent requests might simultaneously trigger redundant downloads, and it guarantees that models are available offline after the first successful download.

### Can the TTL and cache size be customized without modifying source code?

Yes. The `CacheManager` accepts `max_size` and `ttl_seconds` parameters during instantiation. While the global singleton uses defaults (1000 entries, 3600 seconds), you can instantiate a custom `CacheManager` with alternate values for specific components, or modify the module-level initialization in [`server/utils/cache.py`](https://github.com/1517005260/graph-rag-agent/blob/main/server/utils/cache.py) to adjust global defaults.