How Caching is Implemented in GraphRAG Agent for Performance Optimization
GraphRAG Agent employs a dual-layer caching strategy that combines an in-memory LRU/quality-aware request cache with a disk-based model file cache to eliminate redundant LLM calls and prevent repeated transformer downloads.
The 1517005260/graph-rag-agent repository implements sophisticated caching mechanisms to ensure high-throughput retrieval-augmented generation. By storing expensive query results in memory and pre-loading embedding models to local storage, the system minimizes latency for repeat requests while reducing network overhead during startup.
In-Memory Request Cache
The primary caching layer resides in server/utils/cache.py, where the CacheManager class provides a lightweight, thread-safe store for query results.
Cache Architecture and Configuration
The cache initializes with configurable bounds to prevent memory bloat. Each entry persists for a default TTL of 3600 seconds (1 hour) and the store caps at 1000 entries via the max_size parameter.
Cache keys follow a composite structure incorporating both the query string and an optional thread_id, formatted as thread_id:query. This design enables per-conversation isolation, ensuring that identical queries in different threads maintain separate cache entries.
# server/utils/cache.py
class CacheManager:
def __init__(self, max_size: int = 1000, ttl_seconds: int = 3600):
self.cache: Dict[str, Dict[str, Any]] = {}
self.max_size = max_size
self.ttl_seconds = ttl_seconds
Quality-Aware Eviction Policy
When the cache reaches capacity, the _evict_cache method removes entries based on a hybrid scoring system. The algorithm prioritizes quality scores (0.0 to 1.0) over recency, deleting the lowest-quality entry first. If quality scores are equivalent, the least-recently-used entry is evicted.
This quality metric allows the system to retain high-value responses—such as those validated by user feedback—while purging less reliable cached data.
def _evict_cache(self) -> None:
entries = [(k, v["last_access"], v["quality"]) for k, v in self.cache.items()]
entries.sort(key=lambda x: (x[2], x[1])) # low quality → old
del self.cache[entries[0][0]]
Global Singleton Pattern
A module-level singleton instance ensures cache consistency across the FastAPI service. The line cache_manager = CacheManager() creates a single shared state that is imported wherever caching is required, preventing cache fragmentation across different modules.
Model File Cache
The secondary caching layer addresses the cold-start problem of large language models by persisting transformer files to disk. Implemented in graphrag_agent/cache_manager/model_cache.py, this system guarantees that embedding models are downloaded exactly once.
Pre-loading Transformer Models
The preload_sentence_transformer_models function iterates over configured model names and instantiates each SentenceTransformer with a dedicated cache_folder. This forces the underlying HuggingFace libraries to write model weights to MODEL_CACHE_DIR rather than temporary storage.
# graphrag_agent/cache_manager/model_cache.py
def preload_sentence_transformer_models(models: Optional[List[str]] = None) -> None:
from sentence_transformers import SentenceTransformer
cache_dir = ensure_model_cache_dir()
for model_name in models:
_ = SentenceTransformer(model_name, cache_folder=cache_dir)
The system distinguishes between provider types: OpenAI embeddings bypass local caching, while SentenceTransformer models trigger the pre-loading routine.
Initialization at Startup
The initialize_model_cache function executes during the FastAPI startup event, ensuring all heavy models reside on local storage before the first request arrives. This eliminates download latency during active user sessions and prevents timeout errors when multiple concurrent requests attempt to access uncached models simultaneously.
# In the FastAPI startup event
from graphrag_agent.cache_manager.model_cache import initialize_model_cache
@app.on_event("startup")
def on_startup():
# Pre-download SentenceTransformer models and create cache dir
initialize_model_cache()
Implementation Examples
Caching Query Results
To leverage the in-memory cache, import the singleton cache_manager and wrap expensive operations:
from server.utils.cache import cache_manager
def get_answer(query: str, thread_id: str = None):
# Try cached result first
cached = cache_manager.get(query, thread_id)
if cached is not None:
return cached
# Expensive operation (e.g., LLM call)
answer = call_llm(query)
# Store with a quality rating (0-1)
cache_manager.set(query, answer, thread_id, quality=0.9)
return answer
Updating Cache Quality
After receiving user feedback, adjust the quality score to influence retention:
def user_feedback(query: str, rating: float, thread_id: str = None):
# rating is a float between 0.0 and 1.0
cache_manager.update_quality(query, rating, thread_id)
Summary
- Dual-layer architecture: Combines in-memory request caching (
server/utils/cache.py) with disk-based model caching (graphrag_agent/cache_manager/model_cache.py). - Quality-aware eviction: The
CacheManagerprioritizes high-quality entries during cleanup, ensuring valuable responses persist longer than low-quality ones. - Conversation isolation: Cache keys incorporate
thread_idto prevent cross-contamination between different user sessions. - Startup optimization: Model pre-loading occurs during FastAPI initialization, preventing download delays during active request processing.
- Configurable limits: Default settings cap the memory cache at 1000 entries with a 1-hour TTL, balancing performance against resource consumption.
Frequently Asked Questions
How does the cache key handle concurrent conversations?
The cache key format thread_id:query ensures per-conversation isolation. When thread_id is provided, identical queries from different threads generate distinct cache entries, preventing users from accessing cached responses intended for other sessions.
What happens when the in-memory cache reaches its size limit?
When the cache exceeds max_size (default 1000), the _evict_cache method triggers a sorted deletion. It calculates a composite key of (quality_score, last_access_time), removes the entry with the lowest quality first, and falls back to LRU (Least Recently Used) ordering only when quality scores are tied.
Why does GraphRAG Agent use a disk cache for models instead of relying on HuggingFace's default cache?
The custom MODEL_CACHE_DIR implementation in graphrag_agent/cache_manager/model_cache.py ensures deterministic storage locations and explicit pre-loading during application startup. This prevents race conditions where multiple concurrent requests might simultaneously trigger redundant downloads, and it guarantees that models are available offline after the first successful download.
Can the TTL and cache size be customized without modifying source code?
Yes. The CacheManager accepts max_size and ttl_seconds parameters during instantiation. While the global singleton uses defaults (1000 entries, 3600 seconds), you can instantiate a custom CacheManager with alternate values for specific components, or modify the module-level initialization in server/utils/cache.py to adjust global defaults.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →