How to Use the Headroom LiteLLM Callback for Automatic Compression

Import the HeadroomCallback class from headroom.integrations.litellm_callback and append it to litellm.callbacks to enable zero-configuration compression on every LiteLLM request.

The chopratejas/headroom repository provides a drop-in LiteLLM callback for compression that intercepts API calls before they reach the provider, compresses the message payload to reduce token costs, and preserves the original request semantics. Implemented in headroom/integrations/litellm_callback.py, the callback integrates with LiteLLM’s CustomLogger interface and operates in both local (in-process) and cloud-managed modes.

Architecture of the Headroom LiteLLM Callback

The callback architecture follows LiteLLM’s async_pre_call_hook pattern, ensuring compression happens transparently without modifying your application logic.

Callback Class and Initialization

The HeadroomCallback class stores configuration parameters including min_tokens, model_limit, optional custom hooks, and cloud credentials (api_key, api_url). When HEADROOM_API_KEY is detected in the environment or passed explicitly to the constructor, the callback automatically enables cloud mode at lines 51–71 of headroom/integrations/litellm_callback.py.

Pre-Call Hook Execution

The async_pre_call_hook method (lines 84–108) executes before every LiteLLM API call. It validates that the call type is completion or acompletion, extracts the messages payload and target model, then delegates to either _cloud_compress or _local_compress. Upon successful compression, it replaces data["messages"] with the compressed version and increments an internal total_tokens_saved counter.

Local Compression Path

In local mode, the callback invokes headroom.compress.compress (referenced at lines 24–41) to process messages in-process. The compressor receives the original messages, model name, and configured model_limit, returning a CompressionResult containing the compressed payload and token statistics.

Cloud Compression Path

When an API key is present, the callback lazily initializes an httpx.AsyncClient and POSTs the payload to Headroom Cloud’s /v1/saas/compress endpoint (lines 42–74). The cloud response must include messages, tokens_before, and related fields. Errors are logged but do not abort the request, maintaining fail-open behavior.

Setting Up the LiteLLM Callback

Local-Only Compression

For development or air-gapped environments, use the local compressor without external dependencies:

import litellm
from headroom.integrations.litellm_callback import HeadroomCallback

# Register the callback

litellm.callbacks = [HeadroomCallback()]

# All subsequent LiteLLM calls are automatically compressed

response = litellm.completion(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Explain quantum entanglement in simple terms."}]
)
print(response.choices[0].message.content)

Cloud-Managed Compression

For production deployments requiring centralized analytics, provide a Headroom Cloud API key:

import litellm
from headroom.integrations.litellm_callback import HeadroomCallback

# Option A: Pass the key explicitly

litellm.callbacks = [HeadroomCallback(api_key="hdr_XXXXXXXXXXXXXXXX")]

# Option B: Use environment variables (recommended)

# export HEADROOM_API_KEY=hdr_XXXXXXXXXXXXXXXX

# export HEADROOM_API_URL=https://api.headroomlabs.ai  # optional

litellm.callbacks = [HeadroomCallback()]  # Reads HEADROOM_API_KEY automatically

response = litellm.completion(
    model="anthropic.claude-3-sonnet-20240229",
    messages=[{"role": "user", "content": "Summarize the last 10 years of AI research."}]
)

YAML-Based Configuration

When using the litellm CLI or proxy server, configure the callback via YAML:

liteLLM_settings:
  callbacks:
    - headroom.integrations.litellm_callback.HeadroomCallback

environment_variables:
  HEADROOM_API_KEY: "hdr_XXXXXXXXXXXXXXXX"

Configuration Options and Observability

The HeadroomCallback constructor accepts several key parameters:

  • min_tokens: Threshold below which compression is skipped (default varies by model)
  • model_limit: Maximum context window for the target model (auto-detected for known models)
  • api_key / api_url: Cloud service credentials
  • hooks: Optional custom preprocessing functions

Monitor compression effectiveness via the total_tokens_saved property, which accumulates token savings across all invocations (lines 75–78).

Summary

  • Zero-code integration: Append HeadroomCallback() to litellm.callbacks to enable automatic compression on all completion requests.
  • Dual-mode operation: Runs locally using headroom.compress.compress or remotely via Headroom Cloud’s /v1/saas/compress endpoint when HEADROOM_API_KEY is set.
  • Fail-safe design: Compression errors are logged but never block the original API request, ensuring reliability.
  • Observable metrics: Access cumulative token savings through the total_tokens_saved property.
  • Flexible configuration: Supports Python instantiation, environment variables, and YAML-based proxy configuration.

Frequently Asked Questions

How does the callback handle unsupported model names?

The HeadroomCallback attempts to resolve model limits through the internal headroom/providers/litellm.py logic. If the model is unknown, it falls back to a conservative default or the explicit model_limit parameter you provide. Compression still occurs, but the token savings calculation uses estimated context window sizes.

Can I use the callback with LiteLLM’s async methods?

Yes. The async_pre_call_hook implementation in headroom/integrations/litellm_callback.py handles both synchronous completion and asynchronous acompletion calls. The callback uses httpx.AsyncClient for cloud requests, ensuring non-blocking I/O during the compression phase.

Does compression add latency to my requests?

Local compression adds minimal overhead (typically milliseconds) as it runs in-process using the core headroom.compress module. Cloud compression introduces network latency equivalent to one HTTP POST to the Headroom API endpoint, though this is often offset by reduced token processing time on the LLM provider side.

What happens if the Headroom Cloud service is unavailable?

The callback implements fail-open behavior. If the cloud endpoint returns an error or times out, the original messages payload is preserved and the request continues to the LLM provider. Errors are logged via standard Python logging, and the total_tokens_saved counter remains unchanged for that request.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →