how-to-guide

How to Integrate Headroom with LiteLLM Using HeadroomCallback

June 5, 2026 chopratejas/headroom ↗

TLDR: Assign HeadroomCallback to litellm.callbacks to automatically compress every LiteLLM completion request via the async_pre_call_hook, choosing between in‑process local compression and managed cloud compression while preserving the original payload on any failure.

If you want to reduce token usage in your LiteLLM pipelines, the chopratejas/headroom repository provides a drop‑in HeadroomCallback that integrates directly with LiteLLM’s CustomLogger interface. This callback intercepts outgoing requests and compresses the messages array before it reaches the model, supporting both in‑process local compression and managed cloud compression via Headroom’s API.

How HeadroomCallback Intercepts LiteLLM Requests

The async_pre_call_hook Implementation

In headroom/integrations/litellm_callback.py, the HeadroomCallback class implements the asynchronous async_pre_call_hook method. This hook fires before every LiteLLM completion or acompletion call, giving the callback access to the original messages and model parameters. If the call type is not a completion, the callback forwards the request unchanged so other LiteLLM workflows remain unaffected.

Local vs. Cloud Compression Modes

The callback selects its compression path based on the presence of an API key:

Local mode — This is the default. The callback invokes _local_compress, which calls headroom.compress.compress in‑process to shrink the payload and return token statistics.
Cloud mode — If you provide an api_key argument or set the HEADROOM_API_KEY environment variable, the callback uses _cloud_compress. This method lazily instantiates an httpx.AsyncClient and POSTs the payload to https://api.headroomlabs.ai/v1/saas/compress.

After either path succeeds, the callback replaces the original messages field with the compressed version, updates its internal _total_saved counter, and logs a concise summary through the total_tokens_saved property.

Fail‑Safe Behavior

The callback is designed to be non‑blocking. If compression raises an exception—such as a missing httpx library in Cloud mode or a non‑200 API response—the callback logs a warning and returns the original request data unchanged. This ensures that a LiteLLM call never fails because of Headroom.

Integrating Headroom with LiteLLM

Install Dependencies

Install LiteLLM and Headroom. The httpx library is only required if you plan to use Cloud mode.

pip install litellm headroom

Local Mode Setup

For in‑process compression, instantiate HeadroomCallback without an API key and assign it to litellm.callbacks.

import litellm
from headroom.integrations.litellm_callback import HeadroomCallback

# Register the callback

litellm.callbacks = [HeadroomCallback()]

# Use LiteLLM as usual

response = litellm.completion(
    model="gpt-4o",
    messages=[
        {"role": "user", "content": "Explain quantum entanglement in simple terms."}
    ],
)

print(response.choices[0].message["content"])

In this configuration, every completion request is compressed locally before being sent to the LLM provider.

Cloud Mode Setup

To route compression through Headroom’s managed API—enabling features like CCR, TOIN, and analytics—supply an API key.

import os
import litellm
from headroom.integrations.litellm_callback import HeadroomCallback

os.environ["HEADROOM_API_KEY"] = "hdr_XXXXXXXXXXXXXXXX"

# Or pass the key directly

litellm.callbacks = [HeadroomCallback(api_key="hdr_XXXXXXXXXXXXXXXX")]

cloud_resp = litellm.completion(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Summarize the latest AI research."}],
)

print(cloud_resp.choices[0].message["content"])

When Cloud mode is active, _cloud_compress sends the payload to https://api.headroomlabs.ai/v1/saas/compress and returns the result unchanged to LiteLLM.

Tracking Token Savings

The callback exposes a total_tokens_saved property that accumulates savings across all successful compressions.

callback = HeadroomCallback()
litellm.callbacks = [callback]

# After running several completions ...

print(f"Total tokens saved: {callback.total_tokens_saved}")

Architecture and Source Code Overview

Callback Implementation

The HeadroomCallback class is defined in headroom/integrations/litellm_callback.py. It deliberately omits heavy observability logic: async_success_handler and async_failure_handler are no‑ops because Headroom’s telemetry hooks—such as those in headroom.integrations.strands—are attached elsewhere. The class focuses solely on pre‑call payload transformation.

Core Compression Engine

Local compression relies on headroom/compress.py, which contains the core compress function used by _local_compress. This file handles the actual message rewriting and token accounting that the callback reports.

Test Coverage

The tests/test_compress_api.py file validates that the callback imports correctly, compresses payloads as expected, and passes non‑completion call types through without modification.

Summary

Drop‑in integration: Assign HeadroomCallback to litellm.callbacks to compress every LiteLLM completion automatically via async_pre_call_hook.
Two modes: Use Local mode for in‑process compression, or Cloud mode by setting HEADROOM_API_KEY or passing api_key.
Fail‑safe design: If compression raises an exception, the callback logs a warning and returns the original payload so LiteLLM calls never fail.
Observable savings: Access callback.total_tokens_saved to monitor cumulative token reductions across requests.
Key source files: headroom/integrations/litellm_callback.py orchestrates the hook, headroom/compress.py powers local compression, and tests/test_compress_api.py guards regressions.

Frequently Asked Questions

How do I register HeadroomCallback with LiteLLM?

Import HeadroomCallback from headroom.integrations.litellm_callback and append an instance to the litellm.callbacks list. LiteLLM will then invoke async_pre_call_hook before every completion or acompletion request.

What is the difference between Local and Cloud mode?

Local mode runs headroom.compress inside your process via _local_compress and requires no external API key. Cloud mode uses _cloud_compress to POST messages to https://api.headroomlabs.ai/v1/saas/compress using an httpx.AsyncClient, which enables managed analytics and advanced compression strategies.

Does HeadroomCallback break LiteLLM if compression fails?

No. According to the Headroom source code, the callback wraps all compression logic in exception handling. If any error occurs, it logs a warning and returns the original request data unchanged, ensuring the downstream LiteLLM call proceeds unaffected.

How can I view the total tokens saved by the callback?

After attaching a HeadroomCallback instance to litellm.callbacks, read the total_tokens_saved property on that instance. This counter updates after every successful compression and reflects the cumulative tokens saved across all intercepted calls.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:

curl -s "https://instagit.com/install.md"

Add to your MCP client configuration:

{
  "mcpServers": {
    "instagit": {
      "command": "npx",
      "args": ["-y", "instagit@latest"]
    }
  }
}

Ask your agent:

"Use Instagit MCP to understand how chopratejas/headroom works."

Works with

Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →