how-to-guide

How Streaming Works with CCR Response Handling in Headroom

June 10, 2026 chopratejas/headroom ↗

Headroom's streaming CCR response handler buffers incoming chunks to detect CCR tool calls, retrieves original content from the cache, and injects it seamlessly into the stream without client-side intervention.

The Headroom library implements a Compress-Cache-Retrieve (CCR) layer that makes aggressive text compression reversible by caching original content and exposing a retrieval tool. When working with streaming LLM responses, the system must intercept CCR tool calls and inject the full content without breaking the chunk-by-chunk delivery. This article explains how streaming CCR response handling operates under the hood in chopratejas/headroom, referencing the actual source implementation.

The Streaming CCR Handler Architecture

Chunk Buffering and Detection

In ccr/response_handler.py, the StreamingCCRHandler wraps the standard response handler to manage chunked LLM output. The handler uses a StreamingCCRBuffer to accumulate incoming chunks while scanning for the CCR tool call pattern ({"name":"headroom_retrieve",…}).

This buffering occurs transparently, allowing the handler to identify when the LLM emits a request for cached content. As documented in wiki/ARCHITECTURE.md, the handler monitors the stream for JSON payloads containing CCR instructions without blocking the underlying connection.

Content Retrieval and Injection

Once the handler detects a CCR tool call, it pauses the normal stream and switches to buffered mode. The handler extracts the hash key (CCRToolCall.hash_key) from the tool call and queries the CCRStore (implemented in ccr/store.py) to fetch the original payload.

The retrieved text is then injected into the output stream as a new chunk, and the pipeline resumes yielding combined content to the client. This injection happens internally, ensuring the client sees a seamless flow where compressed markers are transparently replaced with full content.

Iterative Processing

The handler loops through the stream continuously, checking for additional CCR tool calls. If the LLM emits multiple retrieval requests in a single interaction, the process repeats:

Buffer incoming chunks in StreamingCCRBuffer
Detect the CCR tool call pattern
Retrieve the original content from CCRStore using the hash key
Inject the retrieved text into the stream
Resume until the next CCR call or stream end

This ensures all compressed markers are resolved before the final response reaches the client.

Implementation Details

The streaming pathway is completely internal. According to the Headroom source code, the client never sees the CCR tool calls; they are handled automatically by the response handler. This design guarantees no data loss while maintaining compatibility with any async or chunked LLM API, including OpenAI's stream=True or Anthropic's streaming endpoints.

The core classes reside in ccr/response_handler.py:

StreamingCCRHandler: Orchestrates the buffering and retrieval process
StreamingCCRBuffer: Accumulates chunks and scans for JSON tool-call payloads
CCRResponseHandler: Base handler for non-streaming CCR operations

Working with Streaming CCR

Enabling CCR with Streaming

By default, CCR is enabled in Headroom. When you set stream=True on a chat completion, the StreamingCCRHandler automatically manages the flow:

from headroom import Headroom, Config

cfg = Config(ccr_enabled=True)          # CCR on by default

client = Headroom(cfg)

# Ask the model a question that will trigger compression + retrieval

response = client.chat(
    messages=[{"role": "user", "content": "Explain the full pipeline for processing a 10kB log file"}],
    stream=True                           # <-- streaming mode

)

# Consume the streaming generator

for chunk in response:
    print(chunk["content"], end="")

In this flow, the LLM first sends a short summary (compressed). The StreamingCCRHandler detects the CCR tool call, fetches the original log file from the CCR cache, and continues streaming the full explanation without interruption.

Understanding CCR Keys

For debugging purposes, you can inspect the CCR key that the response handler searches for:


# Compress content and get the reference key

result = client.compress("some large text")
print("CCR key:", result.ccr_key)         # ← key used by the response handler

The ccr_key corresponds to the hash stored in CCRStore and is what the streaming handler looks for inside the LLM's JSON tool call payload.

Disabling CCR for Pure Compression

If you want compression without the retrieval mechanism, disable CCR in the configuration:

cfg = Config(ccr_enabled=False)           # Turn off CCR

client = Headroom(cfg)

# No CCR tool calls will be generated; the LLM only receives compressed summaries.

Key Source Files

File	Role
`ccr/response_handler.py`	Core CCR response handling, including `CCRResponseHandler`, `StreamingCCRHandler`, and `StreamingCCRBuffer`
`ccr/store.py`	Implements `CCRStore` for caching and retrieving original content by hash key
`wiki/ARCHITECTURE.md`	Architectural diagram and detailed phase description, including streaming handling
`wiki/ccr.md`	High-level CCR concept overview and usage guide

Summary

StreamingCCRHandler in ccr/response_handler.py manages the entire streaming CCR response handling lifecycle.
The handler uses buffering to detect CCR tool calls (headroom_retrieve) in incoming chunks without blocking the stream.
Original content is retrieved from CCRStore using the CCRToolCall.hash_key and injected seamlessly into the output.
The process is iterative, handling multiple CCR calls in a single stream until all compressed content is resolved.
This implementation is compatible with standard LLM streaming APIs like OpenAI and Anthropic, requiring no client-side changes.

Frequently Asked Questions

How does the StreamingCCRBuffer detect CCR tool calls in chunks?

The StreamingCCRBuffer accumulates incoming stream chunks and scans the accumulated text for a specific JSON pattern containing {"name":"headroom_retrieve",…}. This detection happens in real-time as chunks arrive, allowing the handler to intercept tool calls before they reach the client.

What happens if multiple CCR tool calls appear in one stream?

The StreamingCCRHandler processes CCR calls iteratively. After injecting the first retrieved content, it continues buffering the remaining stream. If additional CCR tool calls appear, the handler repeats the detection, retrieval, and injection cycle until the entire response contains no unresolved CCR references.

Is CCR response handling compatible with all LLM streaming APIs?

Yes, the streaming CCR response handling is designed to be API-agnostic. Because the handler operates on the raw chunk stream and manages the headroom_retrieve tool calls internally, it works with any async or chunked LLM API, including OpenAI's stream=True parameter and Anthropic's streaming endpoints.

Where is the original compressed content stored during streaming?

The original content is stored in the CCRStore, implemented in ccr/store.py. When the handler detects a CCR tool call, it extracts the hash key from the call and queries this store to retrieve the full text that was replaced by the compression marker.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:

curl -s "https://instagit.com/install.md"

Add to your MCP client configuration:

{
  "mcpServers": {
    "instagit": {
      "command": "npx",
      "args": ["-y", "instagit@latest"]
    }
  }
}

Ask your agent:

"Use Instagit MCP to understand how chopratejas/headroom works."

Works with

Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →