how-to-guide

How to Use the Headroom compress() API: Single-Function LLM Message Compression

June 3, 2026 chopratejas/headroom ↗

The Headroom compress() API provides a single entry point that shrinks LLM message lists through an internal TransformPipeline without requiring proxy configuration or boilerplate code.

The chopratejas/headroom repository delivers an open-source Python library designed to reduce token costs in LLM applications. The compress() function in headroom/compress.py acts as the sole public interface, automatically handling content routing, cache alignment, and model-specific compression strategies through a lazily-initialized singleton pipeline.

How the compress() API Works Internally

When you invoke compress(), the function builds a singleton TransformPipeline via the internal _get_pipeline() helper (lines 27‑42 in headroom/compress.py). This pipeline wires together three core transformation stages:

Stage	Purpose	Implementation Location
CacheAligner	Aligns token prefixes to ensure KV‑cache hits remain stable across multiple calls	Lines 40‑42 in `headroom/compress.py`
ContentRouter	Detects message types (JSON, code, or plain text) and routes each to the appropriate compressor	Lines 41‑44 in `headroom/compress.py`
Kompress / SmartCrusher / CodeCompressor	Perform actual token‑saving compression tailored to text, structured data, or source code	Invoked via `pipeline.apply()` at line 35

The pipeline executes atomically: your input messages pass through alignment, routing, and compression before returning a structured result containing the optimized message list and token statistics.

Input Validation and Configuration

The API performs strict input validation before processing. If the messages parameter is empty or the optimize flag is set to False, the function returns the original list unchanged (lines 198‑199).

Configuration flows through the CompressConfig dataclass, which supplies sensible defaults such as skipping user messages and protecting the most recent four conversation turns. You can override any configuration field at call time using keyword arguments like compress_user_messages, target_ratio, or protect_recent—these values merge into the config object at lines 202‑207.

Pipeline Execution and Event Hooks

The compression workflow follows a deterministic sequence:

Hook Execution – If you supply a hooks object, the pipeline triggers pre_compress, compute_biases, and post_compress callbacks at the appropriate stages (starting at line 214).
Query Extraction – The helper _extract_user_query() from headroom/utils.py (line 33) isolates the latest user query, allowing compressors to prioritize content relevant to the current task.
Transform Application – The pipeline’s apply() method receives messages, model metadata, context limits, the extracted query, bias maps, and configuration flags, returning a CompressionResult with transformed content.
Event Emission – After routing and compression, the system fires pipeline extension events (INPUT_ROUTED, INPUT_COMPRESSED), enabling integrations to inspect or modify outputs mid-flow.
Result Packaging – The final CompressResult dataclass aggregates the compressed messages, token counts, the calculated compression_ratio, and a list of applied transforms.

Error Handling Behavior

If any exception occurs during processing, Headroom logs the failure, records a metric via get_otel_metrics() from headroom/observability.py, and safely returns the original unmodified messages (lines 311‑324). This fail‑safe design ensures that production LLM calls continue uninterrupted even if compression encounters an edge case.

Practical Code Examples

Basic Usage with Any Provider

from headroom import compress

messages = [
    {"role": "user", "content": "Explain quantum computing in simple terms."},
    {"role": "assistant", "content": "… very long answer …"},
]

# Compress for Claude Sonnet (default model)

result = compress(messages, model="claude-sonnet-4-5-20250929")

print("Compressed messages:", result.messages)
print("Tokens saved:", result.tokens_saved)
print("Compression ratio:", result.compression_ratio)

Integration with Anthropic SDK

from anthropic import Anthropic
from headroom import compress

client = Anthropic()
messages = [{"role": "user", "content": "Huge tool output ..."}]

compressed = compress(messages, model="claude-sonnet-4-5-20250929")
response = client.messages.create(
    model="claude-sonnet-4-5-20250929",
    messages=compressed.messages,
)

Integration with OpenAI SDK

from openai import OpenAI
from headroom import compress

client = OpenAI()
messages = [
    {"role": "user", "content": "Analyze this data"},
    {"role": "tool", "content": "Very large JSON payload …"},
]

compressed = compress(messages, model="gpt-4o")
response = client.chat.completions.create(
    model="gpt-4o",
    messages=compressed.messages,
)

Using LiteLLM for Bedrock Models

import litellm
from headroom import compress

messages = [...]  # your list of dicts

compressed = compress(messages, model="bedrock/claude-sonnet")
response = litellm.completion(model="bedrock/claude-sonnet", messages=compressed.messages)

Direct HTTP Implementation

import httpx
from headroom import compress

messages = [...]  # your messages

compressed = compress(messages, model="claude-sonnet-4-5-20250929")

httpx.post(
    "https://api.anthropic.com/v1/messages",
    json={"model": "claude-sonnet-4-5-20250929", "messages": compressed.messages},
)

Advanced Configuration

from headroom import compress

result = compress(
    messages,
    model="claude-opus-4-20250514",
    compress_user_messages=True,   # also shrink user turns

    target_ratio=0.5,              # keep roughly 50% of tokens

    protect_recent=0,              # compress everything, even the last turn

)

Key Source Files

Understanding these files deepens your ability to debug and extend the compress() API:

headroom/compress.py – Contains the public compress() function, CompressConfig, CompressResult, and the lazy _get_pipeline() singleton factory.
headroom/transforms/__init__.py – Exports TransformPipeline, the orchestrator that sequences CacheAligner, ContentRouter, and concrete compressors.
headroom/transforms/content_router.py – Implements the logic that determines whether to invoke Kompress, SmartCrusher, or CodeCompressor based on content type detection.
headroom/transforms/kompress_compressor.py – Houses the ML‑based text compression engine used as the default for plain‑text messages.
headroom/utils.py – Provides _extract_user_query, the utility that extracts user intent to guide relevance‑aware compression.
headroom/observability.py – Supplies get_otel_metrics(), enabling instrumentation of compression success rates and failure modes.

Summary

The compress() function in headroom/compress.py provides a zero‑configuration entry point for LLM message compression.
Internally, it constructs a singleton TransformPipeline that sequences cache alignment, content routing, and model‑specific compression.
The API accepts standard message dictionaries and returns a CompressResult containing optimized messages plus token statistics.
Configuration occurs through CompressConfig or direct keyword arguments, with sensible defaults protecting recent conversation turns.
Fail‑safe error handling ensures that exceptions return the original message list, maintaining application stability.

Frequently Asked Questions

What happens if the compress() API fails during execution?

If any exception occurs during the compression pipeline, Headroom catches the error at lines 311‑324 in headroom/compress.py, logs the failure, records telemetry via get_otel_metrics(), and returns the original unmodified messages. This design ensures your LLM calls remain functional even when compression encounters unexpected inputs.

Can I compress user messages or only assistant/tool content?

By default, Headroom skips user messages to preserve query intent, but you can override this behavior. Pass compress_user_messages=True as a keyword argument to the compress() function, or set protect_recent=0 to compress all turns including the most recent ones. These parameters merge into the underlying CompressConfig at lines 202‑207.

How does Headroom decide which compression algorithm to apply?

The ContentRouter stage (implemented in headroom/transforms/content_router.py) analyzes each message’s structure to detect JSON payloads, source code blocks, or plain text. Based on this classification, it delegates to Kompress for general text, SmartCrusher for structured data, or CodeCompressor for programming languages, ensuring format‑appropriate token reduction.

Is proxy configuration required to use the compress() function?

No. The compress() API operates as a pure Python function that processes message dictionaries locally. Unlike enterprise compression solutions that require HTTP proxies or middleware, Headroom’s one‑function API performs all transformations in‑process using the TransformPipeline, making it compatible with serverless environments and direct SDK integrations.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:

curl -s "https://instagit.com/install.md"

Add to your MCP client configuration:

{
  "mcpServers": {
    "instagit": {
      "command": "npx",
      "args": ["-y", "instagit@latest"]
    }
  }
}

Ask your agent:

"Use Instagit MCP to understand how chopratejas/headroom works."

Works with

Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →