# How to Configure Semantic Caching with the SemanticCacheLayer in Headroom

> Learn to configure semantic caching with Headroom's SemanticCacheLayer. Optimize LLM requests by storing responses for similar queries, setting a similarity threshold, and calling store_response().

- Repository: [Tejas Chopra/headroom](https://github.com/chopratejas/headroom)
- Tags: how-to-guide
- Published: 2026-06-09

---

**To configure semantic caching in Headroom, wrap your provider optimizer with `SemanticCacheLayer` from [`headroom/cache/semantic.py`](https://github.com/chopratejas/headroom/blob/main/headroom/cache/semantic.py), configure a `similarity_threshold` (default `0.95`), and call `store_response()` after each LLM request to populate the cache for future similar queries.**

The Headroom library provides a pluggable **Semantic Cache Layer** that intercepts LLM requests to return cached responses for semantically similar queries, reducing token costs and latency. By implementing the `SemanticCacheLayer` class defined in [`headroom/cache/semantic.py`](https://github.com/chopratejas/headroom/blob/main/headroom/cache/semantic.py), you can add intelligent caching to any provider-specific optimizer with minimal configuration overhead.

## How the Semantic Cache Layer Works

The `SemanticCacheLayer` operates in three distinct phases when processing a request:

1. **Extract the user query** from the message list, specifically targeting the last entry with `role: "user"`.
2. **Compute a cache key** using SHA-256 hashing of the entire message payload to enable an optional exact-match fallback.
3. **Search the in-memory semantic cache** by embedding the query via a configurable `embedding_fn`, then scanning stored embeddings to compute cosine similarity against the configured `similarity_threshold`.

If an exact hash match exists, the layer returns the cached response immediately. If no semantic or exact match is found, the request delegates to the underlying provider optimizer. After receiving the LLM response, you must explicitly call `store_response()` to insert the new entry into the cache for future hits.

## SemanticCacheLayer Configuration Parameters

All configuration options reside in the **`SemanticCacheConfig`** dataclass within [`headroom/cache/semantic.py`](https://github.com/chopratejas/headroom/blob/main/headroom/cache/semantic.py). The cache implementation uses an LRU store based on `OrderedDict` that respects maximum size limits and optional TTL expiration.

| Parameter | Description | Default |
|-----------|-------------|---------|
| `similarity_threshold` | Cosine similarity required for a cache hit (range 0.0–1.0). | `0.95` |
| `max_entries` | Maximum number of entries to retain in the LRU cache. | `1000` |
| `ttl_seconds` | Time-to-live for entries; set to `0` to disable expiry. | `300` (5 minutes) |
| `embedding_model` | Sentence-transformer model name when no custom function is provided. | `sentence-transformers/all-MiniLM-L6-v2` |
| `use_exact_matching` | Whether to check SHA-256 hash matches before semantic search. | `True` |

## Setting Up Semantic Caching

### Basic Configuration with Default Embeddings

The simplest setup uses the default sentence-transformer model for embeddings. Instantiate `SemanticCacheLayer` by wrapping an existing provider optimizer from the `CacheOptimizerRegistry`:

```python
from headroom.cache import SemanticCacheLayer, CacheOptimizerRegistry

# Retrieve the provider-specific optimizer (e.g., Anthropic)

provider_opt = CacheOptimizerRegistry.get("anthropic")

# Wrap with semantic caching layer

semantic_opt = SemanticCacheLayer(
    provider_opt,
    similarity_threshold=0.95,   # Trigger hit when similarity >= 95%

    max_entries=2000,
    ttl_seconds=600,             # Retain entries for 10 minutes

)  # Source: lines 33-40 of headroom/cache/semantic.py

```

### Using a Custom Embedding Function

For production environments requiring specific embedding models, supply a custom `embedding_fn` that converts text to a list of floats:

```python
from sentence_transformers import SentenceTransformer

embedder = SentenceTransformer("all-MiniLM-L6-v2")

def embed(text: str) -> list[float]:
    # Convert numpy array to plain Python list

    return embedder.encode(text).tolist()

semantic_opt = SemanticCacheLayer(
    provider_opt,
    similarity_threshold=0.90,
    embedding_fn=embed,
)

```

## Processing Requests and Storing Responses

### Intercepting LLM Calls

Use the `process()` method to check the cache before invoking the LLM. The method returns a `CacheResult` object containing `semantic_cache_hit` status and either `cached_response` or the fresh `response`:

```python
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain the difference between LRU and FIFO."},
]

# OptimizationContext carries the original query; can be None

result = semantic_opt.process(messages, context=None)

if result.semantic_cache_hit:
    # Cache hit - no LLM call performed

    answer = result.cached_response
else:
    # Cache miss - provider optimizer performed the LLM request

    answer = result.response
    # Store the fresh response for future semantic matches

    semantic_opt.store_response(messages, answer)

```

### Monitoring Cache Performance

The `get_stats()` method exposes real-time metrics including hit rate, eviction count, and current entry count:

```python
stats = semantic_opt.get_stats()
print("Semantic cache entries:", stats["semantic_cache"]["entries"])
print("Hit rate:", stats["semantic_cache"]["hit_rate"])

```

## Summary

- **Wrap any optimizer**: The `SemanticCacheLayer` in [`headroom/cache/semantic.py`](https://github.com/chopratejas/headroom/blob/main/headroom/cache/semantic.py) wraps provider-specific optimizers to add semantic caching capabilities.
- **Configure similarity**: Set `similarity_threshold` between `0.9` and `0.99` to balance cache hit rate against response accuracy.
- **Explicit storage**: Unlike traditional caches, you must call `store_response()` after successful LLM calls to populate the semantic cache.
- **LRU eviction**: The cache automatically evicts oldest entries when `max_entries` is exceeded and respects `ttl_seconds` for expiration.
- **Dual matching**: Enable `use_exact_matching` for SHA-256 hash checks before expensive cosine similarity computations.

## Frequently Asked Questions

### What is the default similarity threshold in SemanticCacheLayer?

The default `similarity_threshold` is `0.95`, meaning a cache hit requires at least 95% cosine similarity between the embedded query and stored embeddings. You can adjust this parameter based on your tolerance for approximate matches versus exact responses.

### How does Headroom handle exact matches versus semantic similarity?

When `use_exact_matching` is enabled (the default), Headroom first computes a SHA-256 hash of the message payload and checks for an exact match before performing semantic embedding comparisons. This provides immediate returns for identical queries while allowing paraphrased questions to hit the cache via cosine similarity.

### Can I use a custom embedding model with SemanticCacheLayer?

Yes. Provide a custom `embedding_fn` parameter that accepts a string and returns a list of floats. If you omit this parameter, Headroom defaults to using `sentence-transformers/all-MiniLM-L6-v2` via the internal embedding configuration defined in [`headroom/models/config.py`](https://github.com/chopratejas/headroom/blob/main/headroom/models/config.py).

### How do I monitor cache hit rates in Headroom?

Call the `get_stats()` method on your `SemanticCacheLayer` instance to retrieve a dictionary containing `hit_rate`, `entries`, and `evictions` under the `semantic_cache` key. This data helps you tune `similarity_threshold` and `max_entries` for your specific traffic patterns.