# How to Integrate Headroom with LiteLLM Using HeadroomCallback

> Easily integrate Headroom with LiteLLM using HeadroomCallback. Compress LiteLLM completion requests automatically with local or cloud compression. Learn how now.

- Repository: [Tejas Chopra/headroom](https://github.com/chopratejas/headroom)
- Tags: how-to-guide
- Published: 2026-06-05

---

**TLDR:** Assign `HeadroomCallback` to `litellm.callbacks` to automatically compress every LiteLLM `completion` request via the `async_pre_call_hook`, choosing between in‑process local compression and managed cloud compression while preserving the original payload on any failure.

If you want to reduce token usage in your LiteLLM pipelines, the `chopratejas/headroom` repository provides a drop‑in `HeadroomCallback` that integrates directly with LiteLLM’s `CustomLogger` interface. This callback intercepts outgoing requests and compresses the `messages` array before it reaches the model, supporting both in‑process local compression and managed cloud compression via Headroom’s API.

## How HeadroomCallback Intercepts LiteLLM Requests

### The async_pre_call_hook Implementation

In [`headroom/integrations/litellm_callback.py`](https://github.com/chopratejas/headroom/blob/main/headroom/integrations/litellm_callback.py), the `HeadroomCallback` class implements the asynchronous `async_pre_call_hook` method. This hook fires before every LiteLLM `completion` or `acompletion` call, giving the callback access to the original `messages` and `model` parameters. If the call type is not a completion, the callback forwards the request unchanged so other LiteLLM workflows remain unaffected.

### Local vs. Cloud Compression Modes

The callback selects its compression path based on the presence of an API key:

- **Local mode** — This is the default. The callback invokes `_local_compress`, which calls `headroom.compress.compress` in‑process to shrink the payload and return token statistics.
- **Cloud mode** — If you provide an `api_key` argument or set the `HEADROOM_API_KEY` environment variable, the callback uses `_cloud_compress`. This method lazily instantiates an `httpx.AsyncClient` and POSTs the payload to `https://api.headroomlabs.ai/v1/saas/compress`.

After either path succeeds, the callback replaces the original `messages` field with the compressed version, updates its internal `_total_saved` counter, and logs a concise summary through the `total_tokens_saved` property.

### Fail‑Safe Behavior

The callback is designed to be non‑blocking. If compression raises an exception—such as a missing `httpx` library in Cloud mode or a non‑200 API response—the callback logs a warning and returns the original request data unchanged. This ensures that a LiteLLM call never fails because of Headroom.

## Integrating Headroom with LiteLLM

### Install Dependencies

Install LiteLLM and Headroom. The `httpx` library is only required if you plan to use Cloud mode.

```bash
pip install litellm headroom

```

### Local Mode Setup

For in‑process compression, instantiate `HeadroomCallback` without an API key and assign it to `litellm.callbacks`.

```python
import litellm
from headroom.integrations.litellm_callback import HeadroomCallback

# Register the callback

litellm.callbacks = [HeadroomCallback()]

# Use LiteLLM as usual

response = litellm.completion(
    model="gpt-4o",
    messages=[
        {"role": "user", "content": "Explain quantum entanglement in simple terms."}
    ],
)

print(response.choices[0].message["content"])

```

In this configuration, every completion request is compressed locally before being sent to the LLM provider.

### Cloud Mode Setup

To route compression through Headroom’s managed API—enabling features like CCR, TOIN, and analytics—supply an API key.

```python
import os
import litellm
from headroom.integrations.litellm_callback import HeadroomCallback

os.environ["HEADROOM_API_KEY"] = "hdr_XXXXXXXXXXXXXXXX"

# Or pass the key directly

litellm.callbacks = [HeadroomCallback(api_key="hdr_XXXXXXXXXXXXXXXX")]

cloud_resp = litellm.completion(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Summarize the latest AI research."}],
)

print(cloud_resp.choices[0].message["content"])

```

When Cloud mode is active, `_cloud_compress` sends the payload to `https://api.headroomlabs.ai/v1/saas/compress` and returns the result unchanged to LiteLLM.

### Tracking Token Savings

The callback exposes a `total_tokens_saved` property that accumulates savings across all successful compressions.

```python
callback = HeadroomCallback()
litellm.callbacks = [callback]

# After running several completions ...

print(f"Total tokens saved: {callback.total_tokens_saved}")

```

## Architecture and Source Code Overview

### Callback Implementation

The `HeadroomCallback` class is defined in [`headroom/integrations/litellm_callback.py`](https://github.com/chopratejas/headroom/blob/main/headroom/integrations/litellm_callback.py). It deliberately omits heavy observability logic: `async_success_handler` and `async_failure_handler` are no‑ops because Headroom’s telemetry hooks—such as those in `headroom.integrations.strands`—are attached elsewhere. The class focuses solely on pre‑call payload transformation.

### Core Compression Engine

Local compression relies on [`headroom/compress.py`](https://github.com/chopratejas/headroom/blob/main/headroom/compress.py), which contains the core `compress` function used by `_local_compress`. This file handles the actual message rewriting and token accounting that the callback reports.

### Test Coverage

The [`tests/test_compress_api.py`](https://github.com/chopratejas/headroom/blob/main/tests/test_compress_api.py) file validates that the callback imports correctly, compresses payloads as expected, and passes non‑completion call types through without modification.

## Summary

- **Drop‑in integration:** Assign `HeadroomCallback` to `litellm.callbacks` to compress every LiteLLM completion automatically via `async_pre_call_hook`.
- **Two modes:** Use Local mode for in‑process compression, or Cloud mode by setting `HEADROOM_API_KEY` or passing `api_key`.
- **Fail‑safe design:** If compression raises an exception, the callback logs a warning and returns the original payload so LiteLLM calls never fail.
- **Observable savings:** Access `callback.total_tokens_saved` to monitor cumulative token reductions across requests.
- **Key source files:** [`headroom/integrations/litellm_callback.py`](https://github.com/chopratejas/headroom/blob/main/headroom/integrations/litellm_callback.py) orchestrates the hook, [`headroom/compress.py`](https://github.com/chopratejas/headroom/blob/main/headroom/compress.py) powers local compression, and [`tests/test_compress_api.py`](https://github.com/chopratejas/headroom/blob/main/tests/test_compress_api.py) guards regressions.

## Frequently Asked Questions

### How do I register HeadroomCallback with LiteLLM?

Import `HeadroomCallback` from `headroom.integrations.litellm_callback` and append an instance to the `litellm.callbacks` list. LiteLLM will then invoke `async_pre_call_hook` before every `completion` or `acompletion` request.

### What is the difference between Local and Cloud mode?

Local mode runs `headroom.compress` inside your process via `_local_compress` and requires no external API key. Cloud mode uses `_cloud_compress` to POST messages to `https://api.headroomlabs.ai/v1/saas/compress` using an `httpx.AsyncClient`, which enables managed analytics and advanced compression strategies.

### Does HeadroomCallback break LiteLLM if compression fails?

No. According to the Headroom source code, the callback wraps all compression logic in exception handling. If any error occurs, it logs a warning and returns the original request data unchanged, ensuring the downstream LiteLLM call proceeds unaffected.

### How can I view the total tokens saved by the callback?

After attaching a `HeadroomCallback` instance to `litellm.callbacks`, read the `total_tokens_saved` property on that instance. This counter updates after every successful compression and reflects the cumulative tokens saved across all intercepted calls.