# How Streaming Works with CCR Response Handling in Headroom

> Understand streaming CCR response handling in Headroom. Discover how it buffers chunks, retrieves cache content, and injects it seamlessly for efficient stream management.

- Repository: [Tejas Chopra/headroom](https://github.com/chopratejas/headroom)
- Tags: how-to-guide
- Published: 2026-06-10

---

**Headroom's streaming CCR response handler buffers incoming chunks to detect CCR tool calls, retrieves original content from the cache, and injects it seamlessly into the stream without client-side intervention.**

The Headroom library implements a **Compress-Cache-Retrieve (CCR)** layer that makes aggressive text compression reversible by caching original content and exposing a retrieval tool. When working with streaming LLM responses, the system must intercept CCR tool calls and inject the full content without breaking the chunk-by-chunk delivery. This article explains how streaming CCR response handling operates under the hood in `chopratejas/headroom`, referencing the actual source implementation.

## The Streaming CCR Handler Architecture

### Chunk Buffering and Detection

In [`ccr/response_handler.py`](https://github.com/chopratejas/headroom/blob/main/ccr/response_handler.py), the `StreamingCCRHandler` wraps the standard response handler to manage chunked LLM output. The handler uses a `StreamingCCRBuffer` to accumulate incoming chunks while scanning for the CCR tool call pattern (`{"name":"headroom_retrieve",…}`).

This buffering occurs transparently, allowing the handler to identify when the LLM emits a request for cached content. As documented in [`wiki/ARCHITECTURE.md`](https://github.com/chopratejas/headroom/blob/main/wiki/ARCHITECTURE.md), the handler monitors the stream for JSON payloads containing CCR instructions without blocking the underlying connection.

### Content Retrieval and Injection

Once the handler detects a CCR tool call, it pauses the normal stream and switches to **buffered mode**. The handler extracts the hash key (`CCRToolCall.hash_key`) from the tool call and queries the `CCRStore` (implemented in [`ccr/store.py`](https://github.com/chopratejas/headroom/blob/main/ccr/store.py)) to fetch the original payload.

The retrieved text is then injected into the output stream as a new chunk, and the pipeline resumes yielding combined content to the client. This injection happens internally, ensuring the client sees a seamless flow where compressed markers are transparently replaced with full content.

### Iterative Processing

The handler loops through the stream continuously, checking for additional CCR tool calls. If the LLM emits multiple retrieval requests in a single interaction, the process repeats:

1. **Buffer** incoming chunks in `StreamingCCRBuffer`
2. **Detect** the CCR tool call pattern
3. **Retrieve** the original content from `CCRStore` using the hash key
4. **Inject** the retrieved text into the stream
5. **Resume** until the next CCR call or stream end

This ensures **all** compressed markers are resolved before the final response reaches the client.

## Implementation Details

The streaming pathway is completely internal. According to the Headroom source code, the client never sees the CCR tool calls; they are handled automatically by the response handler. This design guarantees no data loss while maintaining compatibility with any async or chunked LLM API, including OpenAI's `stream=True` or Anthropic's streaming endpoints.

The core classes reside in [`ccr/response_handler.py`](https://github.com/chopratejas/headroom/blob/main/ccr/response_handler.py):

- `StreamingCCRHandler`: Orchestrates the buffering and retrieval process
- `StreamingCCRBuffer`: Accumulates chunks and scans for JSON tool-call payloads
- `CCRResponseHandler`: Base handler for non-streaming CCR operations

## Working with Streaming CCR

### Enabling CCR with Streaming

By default, CCR is enabled in Headroom. When you set `stream=True` on a chat completion, the `StreamingCCRHandler` automatically manages the flow:

```python
from headroom import Headroom, Config

cfg = Config(ccr_enabled=True)          # CCR on by default

client = Headroom(cfg)

# Ask the model a question that will trigger compression + retrieval

response = client.chat(
    messages=[{"role": "user", "content": "Explain the full pipeline for processing a 10kB log file"}],
    stream=True                           # <-- streaming mode

)

# Consume the streaming generator

for chunk in response:
    print(chunk["content"], end="")

```

In this flow, the LLM first sends a short summary (compressed). The `StreamingCCRHandler` detects the CCR tool call, fetches the original log file from the CCR cache, and continues streaming the full explanation without interruption.

### Understanding CCR Keys

For debugging purposes, you can inspect the CCR key that the response handler searches for:

```python

# Compress content and get the reference key

result = client.compress("some large text")
print("CCR key:", result.ccr_key)         # ← key used by the response handler

```

The `ccr_key` corresponds to the hash stored in `CCRStore` and is what the streaming handler looks for inside the LLM's JSON tool call payload.

### Disabling CCR for Pure Compression

If you want compression without the retrieval mechanism, disable CCR in the configuration:

```python
cfg = Config(ccr_enabled=False)           # Turn off CCR

client = Headroom(cfg)

# No CCR tool calls will be generated; the LLM only receives compressed summaries.

```

## Key Source Files

| File | Role |
|------|------|
| [`ccr/response_handler.py`](https://github.com/chopratejas/headroom/blob/main/ccr/response_handler.py) | Core CCR response handling, including `CCRResponseHandler`, `StreamingCCRHandler`, and `StreamingCCRBuffer` |
| [`ccr/store.py`](https://github.com/chopratejas/headroom/blob/main/ccr/store.py) | Implements `CCRStore` for caching and retrieving original content by hash key |
| [`wiki/ARCHITECTURE.md`](https://github.com/chopratejas/headroom/blob/main/wiki/ARCHITECTURE.md) | Architectural diagram and detailed phase description, including streaming handling |
| [`wiki/ccr.md`](https://github.com/chopratejas/headroom/blob/main/wiki/ccr.md) | High-level CCR concept overview and usage guide |

## Summary

- **StreamingCCRHandler** in [`ccr/response_handler.py`](https://github.com/chopratejas/headroom/blob/main/ccr/response_handler.py) manages the entire streaming CCR response handling lifecycle.
- The handler uses **buffering** to detect CCR tool calls (`headroom_retrieve`) in incoming chunks without blocking the stream.
- Original content is retrieved from `CCRStore` using the `CCRToolCall.hash_key` and injected seamlessly into the output.
- The process is **iterative**, handling multiple CCR calls in a single stream until all compressed content is resolved.
- This implementation is compatible with standard LLM streaming APIs like OpenAI and Anthropic, requiring no client-side changes.

## Frequently Asked Questions

### How does the StreamingCCRBuffer detect CCR tool calls in chunks?

The `StreamingCCRBuffer` accumulates incoming stream chunks and scans the accumulated text for a specific JSON pattern containing `{"name":"headroom_retrieve",…}`. This detection happens in real-time as chunks arrive, allowing the handler to intercept tool calls before they reach the client.

### What happens if multiple CCR tool calls appear in one stream?

The `StreamingCCRHandler` processes CCR calls iteratively. After injecting the first retrieved content, it continues buffering the remaining stream. If additional CCR tool calls appear, the handler repeats the detection, retrieval, and injection cycle until the entire response contains no unresolved CCR references.

### Is CCR response handling compatible with all LLM streaming APIs?

Yes, the streaming CCR response handling is designed to be API-agnostic. Because the handler operates on the raw chunk stream and manages the `headroom_retrieve` tool calls internally, it works with any async or chunked LLM API, including OpenAI's `stream=True` parameter and Anthropic's streaming endpoints.

### Where is the original compressed content stored during streaming?

The original content is stored in the `CCRStore`, implemented in [`ccr/store.py`](https://github.com/chopratejas/headroom/blob/main/ccr/store.py). When the handler detects a CCR tool call, it extracts the hash key from the call and queries this store to retrieve the full text that was replaced by the compression marker.