How to Configure Semantic Caching with the SemanticCacheLayer in Headroom
To configure semantic caching in Headroom, wrap your provider optimizer with SemanticCacheLayer from headroom/cache/semantic.py, configure a similarity_threshold (default 0.95), and call store_response() after each LLM request to populate the cache for future similar queries.
The Headroom library provides a pluggable Semantic Cache Layer that intercepts LLM requests to return cached responses for semantically similar queries, reducing token costs and latency. By implementing the SemanticCacheLayer class defined in headroom/cache/semantic.py, you can add intelligent caching to any provider-specific optimizer with minimal configuration overhead.
How the Semantic Cache Layer Works
The SemanticCacheLayer operates in three distinct phases when processing a request:
- Extract the user query from the message list, specifically targeting the last entry with
role: "user". - Compute a cache key using SHA-256 hashing of the entire message payload to enable an optional exact-match fallback.
- Search the in-memory semantic cache by embedding the query via a configurable
embedding_fn, then scanning stored embeddings to compute cosine similarity against the configuredsimilarity_threshold.
If an exact hash match exists, the layer returns the cached response immediately. If no semantic or exact match is found, the request delegates to the underlying provider optimizer. After receiving the LLM response, you must explicitly call store_response() to insert the new entry into the cache for future hits.
SemanticCacheLayer Configuration Parameters
All configuration options reside in the SemanticCacheConfig dataclass within headroom/cache/semantic.py. The cache implementation uses an LRU store based on OrderedDict that respects maximum size limits and optional TTL expiration.
| Parameter | Description | Default |
|---|---|---|
similarity_threshold |
Cosine similarity required for a cache hit (range 0.0–1.0). | 0.95 |
max_entries |
Maximum number of entries to retain in the LRU cache. | 1000 |
ttl_seconds |
Time-to-live for entries; set to 0 to disable expiry. |
300 (5 minutes) |
embedding_model |
Sentence-transformer model name when no custom function is provided. | sentence-transformers/all-MiniLM-L6-v2 |
use_exact_matching |
Whether to check SHA-256 hash matches before semantic search. | True |
Setting Up Semantic Caching
Basic Configuration with Default Embeddings
The simplest setup uses the default sentence-transformer model for embeddings. Instantiate SemanticCacheLayer by wrapping an existing provider optimizer from the CacheOptimizerRegistry:
from headroom.cache import SemanticCacheLayer, CacheOptimizerRegistry
# Retrieve the provider-specific optimizer (e.g., Anthropic)
provider_opt = CacheOptimizerRegistry.get("anthropic")
# Wrap with semantic caching layer
semantic_opt = SemanticCacheLayer(
provider_opt,
similarity_threshold=0.95, # Trigger hit when similarity >= 95%
max_entries=2000,
ttl_seconds=600, # Retain entries for 10 minutes
) # Source: lines 33-40 of headroom/cache/semantic.py
Using a Custom Embedding Function
For production environments requiring specific embedding models, supply a custom embedding_fn that converts text to a list of floats:
from sentence_transformers import SentenceTransformer
embedder = SentenceTransformer("all-MiniLM-L6-v2")
def embed(text: str) -> list[float]:
# Convert numpy array to plain Python list
return embedder.encode(text).tolist()
semantic_opt = SemanticCacheLayer(
provider_opt,
similarity_threshold=0.90,
embedding_fn=embed,
)
Processing Requests and Storing Responses
Intercepting LLM Calls
Use the process() method to check the cache before invoking the LLM. The method returns a CacheResult object containing semantic_cache_hit status and either cached_response or the fresh response:
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain the difference between LRU and FIFO."},
]
# OptimizationContext carries the original query; can be None
result = semantic_opt.process(messages, context=None)
if result.semantic_cache_hit:
# Cache hit - no LLM call performed
answer = result.cached_response
else:
# Cache miss - provider optimizer performed the LLM request
answer = result.response
# Store the fresh response for future semantic matches
semantic_opt.store_response(messages, answer)
Monitoring Cache Performance
The get_stats() method exposes real-time metrics including hit rate, eviction count, and current entry count:
stats = semantic_opt.get_stats()
print("Semantic cache entries:", stats["semantic_cache"]["entries"])
print("Hit rate:", stats["semantic_cache"]["hit_rate"])
Summary
- Wrap any optimizer: The
SemanticCacheLayerinheadroom/cache/semantic.pywraps provider-specific optimizers to add semantic caching capabilities. - Configure similarity: Set
similarity_thresholdbetween0.9and0.99to balance cache hit rate against response accuracy. - Explicit storage: Unlike traditional caches, you must call
store_response()after successful LLM calls to populate the semantic cache. - LRU eviction: The cache automatically evicts oldest entries when
max_entriesis exceeded and respectsttl_secondsfor expiration. - Dual matching: Enable
use_exact_matchingfor SHA-256 hash checks before expensive cosine similarity computations.
Frequently Asked Questions
What is the default similarity threshold in SemanticCacheLayer?
The default similarity_threshold is 0.95, meaning a cache hit requires at least 95% cosine similarity between the embedded query and stored embeddings. You can adjust this parameter based on your tolerance for approximate matches versus exact responses.
How does Headroom handle exact matches versus semantic similarity?
When use_exact_matching is enabled (the default), Headroom first computes a SHA-256 hash of the message payload and checks for an exact match before performing semantic embedding comparisons. This provides immediate returns for identical queries while allowing paraphrased questions to hit the cache via cosine similarity.
Can I use a custom embedding model with SemanticCacheLayer?
Yes. Provide a custom embedding_fn parameter that accepts a string and returns a list of floats. If you omit this parameter, Headroom defaults to using sentence-transformers/all-MiniLM-L6-v2 via the internal embedding configuration defined in headroom/models/config.py.
How do I monitor cache hit rates in Headroom?
Call the get_stats() method on your SemanticCacheLayer instance to retrieve a dictionary containing hit_rate, entries, and evictions under the semantic_cache key. This data helps you tune similarity_threshold and max_entries for your specific traffic patterns.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →