How to Use the Headroom compress() API: Single-Function LLM Message Compression
The Headroom compress() API provides a single entry point that shrinks LLM message lists through an internal TransformPipeline without requiring proxy configuration or boilerplate code.
The chopratejas/headroom repository delivers an open-source Python library designed to reduce token costs in LLM applications. The compress() function in headroom/compress.py acts as the sole public interface, automatically handling content routing, cache alignment, and model-specific compression strategies through a lazily-initialized singleton pipeline.
How the compress() API Works Internally
When you invoke compress(), the function builds a singleton TransformPipeline via the internal _get_pipeline() helper (lines 27‑42 in headroom/compress.py). This pipeline wires together three core transformation stages:
| Stage | Purpose | Implementation Location |
|---|---|---|
| CacheAligner | Aligns token prefixes to ensure KV‑cache hits remain stable across multiple calls | Lines 40‑42 in headroom/compress.py |
| ContentRouter | Detects message types (JSON, code, or plain text) and routes each to the appropriate compressor | Lines 41‑44 in headroom/compress.py |
| Kompress / SmartCrusher / CodeCompressor | Perform actual token‑saving compression tailored to text, structured data, or source code | Invoked via pipeline.apply() at line 35 |
The pipeline executes atomically: your input messages pass through alignment, routing, and compression before returning a structured result containing the optimized message list and token statistics.
Input Validation and Configuration
The API performs strict input validation before processing. If the messages parameter is empty or the optimize flag is set to False, the function returns the original list unchanged (lines 198‑199).
Configuration flows through the CompressConfig dataclass, which supplies sensible defaults such as skipping user messages and protecting the most recent four conversation turns. You can override any configuration field at call time using keyword arguments like compress_user_messages, target_ratio, or protect_recent—these values merge into the config object at lines 202‑207.
Pipeline Execution and Event Hooks
The compression workflow follows a deterministic sequence:
- Hook Execution – If you supply a
hooksobject, the pipeline triggerspre_compress,compute_biases, andpost_compresscallbacks at the appropriate stages (starting at line 214). - Query Extraction – The helper
_extract_user_query()fromheadroom/utils.py(line 33) isolates the latest user query, allowing compressors to prioritize content relevant to the current task. - Transform Application – The pipeline’s
apply()method receives messages, model metadata, context limits, the extracted query, bias maps, and configuration flags, returning aCompressionResultwith transformed content. - Event Emission – After routing and compression, the system fires pipeline extension events (
INPUT_ROUTED,INPUT_COMPRESSED), enabling integrations to inspect or modify outputs mid-flow. - Result Packaging – The final
CompressResultdataclass aggregates the compressed messages, token counts, the calculatedcompression_ratio, and a list of applied transforms.
Error Handling Behavior
If any exception occurs during processing, Headroom logs the failure, records a metric via get_otel_metrics() from headroom/observability.py, and safely returns the original unmodified messages (lines 311‑324). This fail‑safe design ensures that production LLM calls continue uninterrupted even if compression encounters an edge case.
Practical Code Examples
Basic Usage with Any Provider
from headroom import compress
messages = [
{"role": "user", "content": "Explain quantum computing in simple terms."},
{"role": "assistant", "content": "… very long answer …"},
]
# Compress for Claude Sonnet (default model)
result = compress(messages, model="claude-sonnet-4-5-20250929")
print("Compressed messages:", result.messages)
print("Tokens saved:", result.tokens_saved)
print("Compression ratio:", result.compression_ratio)
Integration with Anthropic SDK
from anthropic import Anthropic
from headroom import compress
client = Anthropic()
messages = [{"role": "user", "content": "Huge tool output ..."}]
compressed = compress(messages, model="claude-sonnet-4-5-20250929")
response = client.messages.create(
model="claude-sonnet-4-5-20250929",
messages=compressed.messages,
)
Integration with OpenAI SDK
from openai import OpenAI
from headroom import compress
client = OpenAI()
messages = [
{"role": "user", "content": "Analyze this data"},
{"role": "tool", "content": "Very large JSON payload …"},
]
compressed = compress(messages, model="gpt-4o")
response = client.chat.completions.create(
model="gpt-4o",
messages=compressed.messages,
)
Using LiteLLM for Bedrock Models
import litellm
from headroom import compress
messages = [...] # your list of dicts
compressed = compress(messages, model="bedrock/claude-sonnet")
response = litellm.completion(model="bedrock/claude-sonnet", messages=compressed.messages)
Direct HTTP Implementation
import httpx
from headroom import compress
messages = [...] # your messages
compressed = compress(messages, model="claude-sonnet-4-5-20250929")
httpx.post(
"https://api.anthropic.com/v1/messages",
json={"model": "claude-sonnet-4-5-20250929", "messages": compressed.messages},
)
Advanced Configuration
from headroom import compress
result = compress(
messages,
model="claude-opus-4-20250514",
compress_user_messages=True, # also shrink user turns
target_ratio=0.5, # keep roughly 50% of tokens
protect_recent=0, # compress everything, even the last turn
)
Key Source Files
Understanding these files deepens your ability to debug and extend the compress() API:
headroom/compress.py– Contains the publiccompress()function,CompressConfig,CompressResult, and the lazy_get_pipeline()singleton factory.headroom/transforms/__init__.py– ExportsTransformPipeline, the orchestrator that sequences CacheAligner, ContentRouter, and concrete compressors.headroom/transforms/content_router.py– Implements the logic that determines whether to invoke Kompress, SmartCrusher, or CodeCompressor based on content type detection.headroom/transforms/kompress_compressor.py– Houses the ML‑based text compression engine used as the default for plain‑text messages.headroom/utils.py– Provides_extract_user_query, the utility that extracts user intent to guide relevance‑aware compression.headroom/observability.py– Suppliesget_otel_metrics(), enabling instrumentation of compression success rates and failure modes.
Summary
- The
compress()function inheadroom/compress.pyprovides a zero‑configuration entry point for LLM message compression. - Internally, it constructs a singleton
TransformPipelinethat sequences cache alignment, content routing, and model‑specific compression. - The API accepts standard message dictionaries and returns a
CompressResultcontaining optimized messages plus token statistics. - Configuration occurs through
CompressConfigor direct keyword arguments, with sensible defaults protecting recent conversation turns. - Fail‑safe error handling ensures that exceptions return the original message list, maintaining application stability.
Frequently Asked Questions
What happens if the compress() API fails during execution?
If any exception occurs during the compression pipeline, Headroom catches the error at lines 311‑324 in headroom/compress.py, logs the failure, records telemetry via get_otel_metrics(), and returns the original unmodified messages. This design ensures your LLM calls remain functional even when compression encounters unexpected inputs.
Can I compress user messages or only assistant/tool content?
By default, Headroom skips user messages to preserve query intent, but you can override this behavior. Pass compress_user_messages=True as a keyword argument to the compress() function, or set protect_recent=0 to compress all turns including the most recent ones. These parameters merge into the underlying CompressConfig at lines 202‑207.
How does Headroom decide which compression algorithm to apply?
The ContentRouter stage (implemented in headroom/transforms/content_router.py) analyzes each message’s structure to detect JSON payloads, source code blocks, or plain text. Based on this classification, it delegates to Kompress for general text, SmartCrusher for structured data, or CodeCompressor for programming languages, ensuring format‑appropriate token reduction.
Is proxy configuration required to use the compress() function?
No. The compress() API operates as a pure Python function that processes message dictionaries locally. Unlike enterprise compression solutions that require HTTP proxies or middleware, Headroom’s one‑function API performs all transformations in‑process using the TransformPipeline, making it compatible with serverless environments and direct SDK integrations.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →