How to Use the Headroom LiteLLM Callback for Automatic Compression
Import the HeadroomCallback class from headroom.integrations.litellm_callback and append it to litellm.callbacks to enable zero-configuration compression on every LiteLLM request.
The chopratejas/headroom repository provides a drop-in LiteLLM callback for compression that intercepts API calls before they reach the provider, compresses the message payload to reduce token costs, and preserves the original request semantics. Implemented in headroom/integrations/litellm_callback.py, the callback integrates with LiteLLM’s CustomLogger interface and operates in both local (in-process) and cloud-managed modes.
Architecture of the Headroom LiteLLM Callback
The callback architecture follows LiteLLM’s async_pre_call_hook pattern, ensuring compression happens transparently without modifying your application logic.
Callback Class and Initialization
The HeadroomCallback class stores configuration parameters including min_tokens, model_limit, optional custom hooks, and cloud credentials (api_key, api_url). When HEADROOM_API_KEY is detected in the environment or passed explicitly to the constructor, the callback automatically enables cloud mode at lines 51–71 of headroom/integrations/litellm_callback.py.
Pre-Call Hook Execution
The async_pre_call_hook method (lines 84–108) executes before every LiteLLM API call. It validates that the call type is completion or acompletion, extracts the messages payload and target model, then delegates to either _cloud_compress or _local_compress. Upon successful compression, it replaces data["messages"] with the compressed version and increments an internal total_tokens_saved counter.
Local Compression Path
In local mode, the callback invokes headroom.compress.compress (referenced at lines 24–41) to process messages in-process. The compressor receives the original messages, model name, and configured model_limit, returning a CompressionResult containing the compressed payload and token statistics.
Cloud Compression Path
When an API key is present, the callback lazily initializes an httpx.AsyncClient and POSTs the payload to Headroom Cloud’s /v1/saas/compress endpoint (lines 42–74). The cloud response must include messages, tokens_before, and related fields. Errors are logged but do not abort the request, maintaining fail-open behavior.
Setting Up the LiteLLM Callback
Local-Only Compression
For development or air-gapped environments, use the local compressor without external dependencies:
import litellm
from headroom.integrations.litellm_callback import HeadroomCallback
# Register the callback
litellm.callbacks = [HeadroomCallback()]
# All subsequent LiteLLM calls are automatically compressed
response = litellm.completion(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Explain quantum entanglement in simple terms."}]
)
print(response.choices[0].message.content)
Cloud-Managed Compression
For production deployments requiring centralized analytics, provide a Headroom Cloud API key:
import litellm
from headroom.integrations.litellm_callback import HeadroomCallback
# Option A: Pass the key explicitly
litellm.callbacks = [HeadroomCallback(api_key="hdr_XXXXXXXXXXXXXXXX")]
# Option B: Use environment variables (recommended)
# export HEADROOM_API_KEY=hdr_XXXXXXXXXXXXXXXX
# export HEADROOM_API_URL=https://api.headroomlabs.ai # optional
litellm.callbacks = [HeadroomCallback()] # Reads HEADROOM_API_KEY automatically
response = litellm.completion(
model="anthropic.claude-3-sonnet-20240229",
messages=[{"role": "user", "content": "Summarize the last 10 years of AI research."}]
)
YAML-Based Configuration
When using the litellm CLI or proxy server, configure the callback via YAML:
liteLLM_settings:
callbacks:
- headroom.integrations.litellm_callback.HeadroomCallback
environment_variables:
HEADROOM_API_KEY: "hdr_XXXXXXXXXXXXXXXX"
Configuration Options and Observability
The HeadroomCallback constructor accepts several key parameters:
min_tokens: Threshold below which compression is skipped (default varies by model)model_limit: Maximum context window for the target model (auto-detected for known models)api_key/api_url: Cloud service credentialshooks: Optional custom preprocessing functions
Monitor compression effectiveness via the total_tokens_saved property, which accumulates token savings across all invocations (lines 75–78).
Summary
- Zero-code integration: Append
HeadroomCallback()tolitellm.callbacksto enable automatic compression on all completion requests. - Dual-mode operation: Runs locally using
headroom.compress.compressor remotely via Headroom Cloud’s/v1/saas/compressendpoint whenHEADROOM_API_KEYis set. - Fail-safe design: Compression errors are logged but never block the original API request, ensuring reliability.
- Observable metrics: Access cumulative token savings through the
total_tokens_savedproperty. - Flexible configuration: Supports Python instantiation, environment variables, and YAML-based proxy configuration.
Frequently Asked Questions
How does the callback handle unsupported model names?
The HeadroomCallback attempts to resolve model limits through the internal headroom/providers/litellm.py logic. If the model is unknown, it falls back to a conservative default or the explicit model_limit parameter you provide. Compression still occurs, but the token savings calculation uses estimated context window sizes.
Can I use the callback with LiteLLM’s async methods?
Yes. The async_pre_call_hook implementation in headroom/integrations/litellm_callback.py handles both synchronous completion and asynchronous acompletion calls. The callback uses httpx.AsyncClient for cloud requests, ensuring non-blocking I/O during the compression phase.
Does compression add latency to my requests?
Local compression adds minimal overhead (typically milliseconds) as it runs in-process using the core headroom.compress module. Cloud compression introduces network latency equivalent to one HTTP POST to the Headroom API endpoint, though this is often offset by reduced token processing time on the LLM provider side.
What happens if the Headroom Cloud service is unavailable?
The callback implements fail-open behavior. If the cloud endpoint returns an error or times out, the original messages payload is preserved and the request continues to the LLM provider. Errors are logged via standard Python logging, and the total_tokens_saved counter remains unchanged for that request.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →