How to Configure Semantic Caching with the SemanticCacheLayer in Headroom

To configure semantic caching in Headroom, wrap your provider optimizer with SemanticCacheLayer from headroom/cache/semantic.py, configure a similarity_threshold (default 0.95), and call store_response() after each LLM request to populate the cache for future similar queries.

The Headroom library provides a pluggable Semantic Cache Layer that intercepts LLM requests to return cached responses for semantically similar queries, reducing token costs and latency. By implementing the SemanticCacheLayer class defined in headroom/cache/semantic.py, you can add intelligent caching to any provider-specific optimizer with minimal configuration overhead.

How the Semantic Cache Layer Works

The SemanticCacheLayer operates in three distinct phases when processing a request:

  1. Extract the user query from the message list, specifically targeting the last entry with role: "user".
  2. Compute a cache key using SHA-256 hashing of the entire message payload to enable an optional exact-match fallback.
  3. Search the in-memory semantic cache by embedding the query via a configurable embedding_fn, then scanning stored embeddings to compute cosine similarity against the configured similarity_threshold.

If an exact hash match exists, the layer returns the cached response immediately. If no semantic or exact match is found, the request delegates to the underlying provider optimizer. After receiving the LLM response, you must explicitly call store_response() to insert the new entry into the cache for future hits.

SemanticCacheLayer Configuration Parameters

All configuration options reside in the SemanticCacheConfig dataclass within headroom/cache/semantic.py. The cache implementation uses an LRU store based on OrderedDict that respects maximum size limits and optional TTL expiration.

Parameter Description Default
similarity_threshold Cosine similarity required for a cache hit (range 0.0–1.0). 0.95
max_entries Maximum number of entries to retain in the LRU cache. 1000
ttl_seconds Time-to-live for entries; set to 0 to disable expiry. 300 (5 minutes)
embedding_model Sentence-transformer model name when no custom function is provided. sentence-transformers/all-MiniLM-L6-v2
use_exact_matching Whether to check SHA-256 hash matches before semantic search. True

Setting Up Semantic Caching

Basic Configuration with Default Embeddings

The simplest setup uses the default sentence-transformer model for embeddings. Instantiate SemanticCacheLayer by wrapping an existing provider optimizer from the CacheOptimizerRegistry:

from headroom.cache import SemanticCacheLayer, CacheOptimizerRegistry

# Retrieve the provider-specific optimizer (e.g., Anthropic)

provider_opt = CacheOptimizerRegistry.get("anthropic")

# Wrap with semantic caching layer

semantic_opt = SemanticCacheLayer(
    provider_opt,
    similarity_threshold=0.95,   # Trigger hit when similarity >= 95%

    max_entries=2000,
    ttl_seconds=600,             # Retain entries for 10 minutes

)  # Source: lines 33-40 of headroom/cache/semantic.py

Using a Custom Embedding Function

For production environments requiring specific embedding models, supply a custom embedding_fn that converts text to a list of floats:

from sentence_transformers import SentenceTransformer

embedder = SentenceTransformer("all-MiniLM-L6-v2")

def embed(text: str) -> list[float]:
    # Convert numpy array to plain Python list

    return embedder.encode(text).tolist()

semantic_opt = SemanticCacheLayer(
    provider_opt,
    similarity_threshold=0.90,
    embedding_fn=embed,
)

Processing Requests and Storing Responses

Intercepting LLM Calls

Use the process() method to check the cache before invoking the LLM. The method returns a CacheResult object containing semantic_cache_hit status and either cached_response or the fresh response:

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain the difference between LRU and FIFO."},
]

# OptimizationContext carries the original query; can be None

result = semantic_opt.process(messages, context=None)

if result.semantic_cache_hit:
    # Cache hit - no LLM call performed

    answer = result.cached_response
else:
    # Cache miss - provider optimizer performed the LLM request

    answer = result.response
    # Store the fresh response for future semantic matches

    semantic_opt.store_response(messages, answer)

Monitoring Cache Performance

The get_stats() method exposes real-time metrics including hit rate, eviction count, and current entry count:

stats = semantic_opt.get_stats()
print("Semantic cache entries:", stats["semantic_cache"]["entries"])
print("Hit rate:", stats["semantic_cache"]["hit_rate"])

Summary

  • Wrap any optimizer: The SemanticCacheLayer in headroom/cache/semantic.py wraps provider-specific optimizers to add semantic caching capabilities.
  • Configure similarity: Set similarity_threshold between 0.9 and 0.99 to balance cache hit rate against response accuracy.
  • Explicit storage: Unlike traditional caches, you must call store_response() after successful LLM calls to populate the semantic cache.
  • LRU eviction: The cache automatically evicts oldest entries when max_entries is exceeded and respects ttl_seconds for expiration.
  • Dual matching: Enable use_exact_matching for SHA-256 hash checks before expensive cosine similarity computations.

Frequently Asked Questions

What is the default similarity threshold in SemanticCacheLayer?

The default similarity_threshold is 0.95, meaning a cache hit requires at least 95% cosine similarity between the embedded query and stored embeddings. You can adjust this parameter based on your tolerance for approximate matches versus exact responses.

How does Headroom handle exact matches versus semantic similarity?

When use_exact_matching is enabled (the default), Headroom first computes a SHA-256 hash of the message payload and checks for an exact match before performing semantic embedding comparisons. This provides immediate returns for identical queries while allowing paraphrased questions to hit the cache via cosine similarity.

Can I use a custom embedding model with SemanticCacheLayer?

Yes. Provide a custom embedding_fn parameter that accepts a string and returns a list of floats. If you omit this parameter, Headroom defaults to using sentence-transformers/all-MiniLM-L6-v2 via the internal embedding configuration defined in headroom/models/config.py.

How do I monitor cache hit rates in Headroom?

Call the get_stats() method on your SemanticCacheLayer instance to retrieve a dictionary containing hit_rate, entries, and evictions under the semantic_cache key. This data helps you tune similarity_threshold and max_entries for your specific traffic patterns.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →