How Streaming Works with CCR Response Handling in Headroom
Headroom's streaming CCR response handler buffers incoming chunks to detect CCR tool calls, retrieves original content from the cache, and injects it seamlessly into the stream without client-side intervention.
The Headroom library implements a Compress-Cache-Retrieve (CCR) layer that makes aggressive text compression reversible by caching original content and exposing a retrieval tool. When working with streaming LLM responses, the system must intercept CCR tool calls and inject the full content without breaking the chunk-by-chunk delivery. This article explains how streaming CCR response handling operates under the hood in chopratejas/headroom, referencing the actual source implementation.
The Streaming CCR Handler Architecture
Chunk Buffering and Detection
In ccr/response_handler.py, the StreamingCCRHandler wraps the standard response handler to manage chunked LLM output. The handler uses a StreamingCCRBuffer to accumulate incoming chunks while scanning for the CCR tool call pattern ({"name":"headroom_retrieve",…}).
This buffering occurs transparently, allowing the handler to identify when the LLM emits a request for cached content. As documented in wiki/ARCHITECTURE.md, the handler monitors the stream for JSON payloads containing CCR instructions without blocking the underlying connection.
Content Retrieval and Injection
Once the handler detects a CCR tool call, it pauses the normal stream and switches to buffered mode. The handler extracts the hash key (CCRToolCall.hash_key) from the tool call and queries the CCRStore (implemented in ccr/store.py) to fetch the original payload.
The retrieved text is then injected into the output stream as a new chunk, and the pipeline resumes yielding combined content to the client. This injection happens internally, ensuring the client sees a seamless flow where compressed markers are transparently replaced with full content.
Iterative Processing
The handler loops through the stream continuously, checking for additional CCR tool calls. If the LLM emits multiple retrieval requests in a single interaction, the process repeats:
- Buffer incoming chunks in
StreamingCCRBuffer - Detect the CCR tool call pattern
- Retrieve the original content from
CCRStoreusing the hash key - Inject the retrieved text into the stream
- Resume until the next CCR call or stream end
This ensures all compressed markers are resolved before the final response reaches the client.
Implementation Details
The streaming pathway is completely internal. According to the Headroom source code, the client never sees the CCR tool calls; they are handled automatically by the response handler. This design guarantees no data loss while maintaining compatibility with any async or chunked LLM API, including OpenAI's stream=True or Anthropic's streaming endpoints.
The core classes reside in ccr/response_handler.py:
StreamingCCRHandler: Orchestrates the buffering and retrieval processStreamingCCRBuffer: Accumulates chunks and scans for JSON tool-call payloadsCCRResponseHandler: Base handler for non-streaming CCR operations
Working with Streaming CCR
Enabling CCR with Streaming
By default, CCR is enabled in Headroom. When you set stream=True on a chat completion, the StreamingCCRHandler automatically manages the flow:
from headroom import Headroom, Config
cfg = Config(ccr_enabled=True) # CCR on by default
client = Headroom(cfg)
# Ask the model a question that will trigger compression + retrieval
response = client.chat(
messages=[{"role": "user", "content": "Explain the full pipeline for processing a 10kB log file"}],
stream=True # <-- streaming mode
)
# Consume the streaming generator
for chunk in response:
print(chunk["content"], end="")
In this flow, the LLM first sends a short summary (compressed). The StreamingCCRHandler detects the CCR tool call, fetches the original log file from the CCR cache, and continues streaming the full explanation without interruption.
Understanding CCR Keys
For debugging purposes, you can inspect the CCR key that the response handler searches for:
# Compress content and get the reference key
result = client.compress("some large text")
print("CCR key:", result.ccr_key) # ← key used by the response handler
The ccr_key corresponds to the hash stored in CCRStore and is what the streaming handler looks for inside the LLM's JSON tool call payload.
Disabling CCR for Pure Compression
If you want compression without the retrieval mechanism, disable CCR in the configuration:
cfg = Config(ccr_enabled=False) # Turn off CCR
client = Headroom(cfg)
# No CCR tool calls will be generated; the LLM only receives compressed summaries.
Key Source Files
| File | Role |
|---|---|
ccr/response_handler.py |
Core CCR response handling, including CCRResponseHandler, StreamingCCRHandler, and StreamingCCRBuffer |
ccr/store.py |
Implements CCRStore for caching and retrieving original content by hash key |
wiki/ARCHITECTURE.md |
Architectural diagram and detailed phase description, including streaming handling |
wiki/ccr.md |
High-level CCR concept overview and usage guide |
Summary
- StreamingCCRHandler in
ccr/response_handler.pymanages the entire streaming CCR response handling lifecycle. - The handler uses buffering to detect CCR tool calls (
headroom_retrieve) in incoming chunks without blocking the stream. - Original content is retrieved from
CCRStoreusing theCCRToolCall.hash_keyand injected seamlessly into the output. - The process is iterative, handling multiple CCR calls in a single stream until all compressed content is resolved.
- This implementation is compatible with standard LLM streaming APIs like OpenAI and Anthropic, requiring no client-side changes.
Frequently Asked Questions
How does the StreamingCCRBuffer detect CCR tool calls in chunks?
The StreamingCCRBuffer accumulates incoming stream chunks and scans the accumulated text for a specific JSON pattern containing {"name":"headroom_retrieve",…}. This detection happens in real-time as chunks arrive, allowing the handler to intercept tool calls before they reach the client.
What happens if multiple CCR tool calls appear in one stream?
The StreamingCCRHandler processes CCR calls iteratively. After injecting the first retrieved content, it continues buffering the remaining stream. If additional CCR tool calls appear, the handler repeats the detection, retrieval, and injection cycle until the entire response contains no unresolved CCR references.
Is CCR response handling compatible with all LLM streaming APIs?
Yes, the streaming CCR response handling is designed to be API-agnostic. Because the handler operates on the raw chunk stream and manages the headroom_retrieve tool calls internally, it works with any async or chunked LLM API, including OpenAI's stream=True parameter and Anthropic's streaming endpoints.
Where is the original compressed content stored during streaming?
The original content is stored in the CCRStore, implemented in ccr/store.py. When the handler detects a CCR tool call, it extracts the hash key from the call and queries this store to retrieve the full text that was replaced by the compression marker.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →