# How to Use the Headroom LiteLLM Callback for Automatic Compression

> Learn how to use the Headroom LiteLLM callback to automatically compress your requests. Integrate seamlessly for zero-configuration compression with every LiteLLM call.

- Repository: [Tejas Chopra/headroom](https://github.com/chopratejas/headroom)
- Tags: how-to-guide
- Published: 2026-06-10

---

**Import the `HeadroomCallback` class from `headroom.integrations.litellm_callback` and append it to `litellm.callbacks` to enable zero-configuration compression on every LiteLLM request.**

The `chopratejas/headroom` repository provides a drop-in **LiteLLM callback for compression** that intercepts API calls before they reach the provider, compresses the message payload to reduce token costs, and preserves the original request semantics. Implemented in [`headroom/integrations/litellm_callback.py`](https://github.com/chopratejas/headroom/blob/main/headroom/integrations/litellm_callback.py), the callback integrates with LiteLLM’s `CustomLogger` interface and operates in both local (in-process) and cloud-managed modes.

## Architecture of the Headroom LiteLLM Callback

The callback architecture follows LiteLLM’s `async_pre_call_hook` pattern, ensuring compression happens transparently without modifying your application logic.

### Callback Class and Initialization

The `HeadroomCallback` class stores configuration parameters including `min_tokens`, `model_limit`, optional custom `hooks`, and cloud credentials (`api_key`, `api_url`). When `HEADROOM_API_KEY` is detected in the environment or passed explicitly to the constructor, the callback automatically enables cloud mode at lines 51–71 of [`headroom/integrations/litellm_callback.py`](https://github.com/chopratejas/headroom/blob/main/headroom/integrations/litellm_callback.py).

### Pre-Call Hook Execution

The `async_pre_call_hook` method (lines 84–108) executes before every LiteLLM API call. It validates that the call type is `completion` or `acompletion`, extracts the `messages` payload and target `model`, then delegates to either `_cloud_compress` or `_local_compress`. Upon successful compression, it replaces `data["messages"]` with the compressed version and increments an internal `total_tokens_saved` counter.

### Local Compression Path

In local mode, the callback invokes `headroom.compress.compress` (referenced at lines 24–41) to process messages in-process. The compressor receives the original messages, model name, and configured `model_limit`, returning a `CompressionResult` containing the compressed payload and token statistics.

### Cloud Compression Path

When an API key is present, the callback lazily initializes an `httpx.AsyncClient` and POSTs the payload to Headroom Cloud’s `/v1/saas/compress` endpoint (lines 42–74). The cloud response must include `messages`, `tokens_before`, and related fields. Errors are logged but do not abort the request, maintaining fail-open behavior.

## Setting Up the LiteLLM Callback

### Local-Only Compression

For development or air-gapped environments, use the local compressor without external dependencies:

```python
import litellm
from headroom.integrations.litellm_callback import HeadroomCallback

# Register the callback

litellm.callbacks = [HeadroomCallback()]

# All subsequent LiteLLM calls are automatically compressed

response = litellm.completion(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Explain quantum entanglement in simple terms."}]
)
print(response.choices[0].message.content)

```

### Cloud-Managed Compression

For production deployments requiring centralized analytics, provide a Headroom Cloud API key:

```python
import litellm
from headroom.integrations.litellm_callback import HeadroomCallback

# Option A: Pass the key explicitly

litellm.callbacks = [HeadroomCallback(api_key="hdr_XXXXXXXXXXXXXXXX")]

# Option B: Use environment variables (recommended)

# export HEADROOM_API_KEY=hdr_XXXXXXXXXXXXXXXX

# export HEADROOM_API_URL=https://api.headroomlabs.ai  # optional

litellm.callbacks = [HeadroomCallback()]  # Reads HEADROOM_API_KEY automatically

response = litellm.completion(
    model="anthropic.claude-3-sonnet-20240229",
    messages=[{"role": "user", "content": "Summarize the last 10 years of AI research."}]
)

```

### YAML-Based Configuration

When using the `litellm` CLI or proxy server, configure the callback via YAML:

```yaml
liteLLM_settings:
  callbacks:
    - headroom.integrations.litellm_callback.HeadroomCallback

environment_variables:
  HEADROOM_API_KEY: "hdr_XXXXXXXXXXXXXXXX"

```

## Configuration Options and Observability

The `HeadroomCallback` constructor accepts several key parameters:

- **`min_tokens`**: Threshold below which compression is skipped (default varies by model)
- **`model_limit`**: Maximum context window for the target model (auto-detected for known models)
- **`api_key`** / **`api_url`**: Cloud service credentials
- **`hooks`**: Optional custom preprocessing functions

Monitor compression effectiveness via the `total_tokens_saved` property, which accumulates token savings across all invocations (lines 75–78).

## Summary

- **Zero-code integration**: Append `HeadroomCallback()` to `litellm.callbacks` to enable automatic compression on all completion requests.
- **Dual-mode operation**: Runs locally using `headroom.compress.compress` or remotely via Headroom Cloud’s `/v1/saas/compress` endpoint when `HEADROOM_API_KEY` is set.
- **Fail-safe design**: Compression errors are logged but never block the original API request, ensuring reliability.
- **Observable metrics**: Access cumulative token savings through the `total_tokens_saved` property.
- **Flexible configuration**: Supports Python instantiation, environment variables, and YAML-based proxy configuration.

## Frequently Asked Questions

### How does the callback handle unsupported model names?

The `HeadroomCallback` attempts to resolve model limits through the internal [`headroom/providers/litellm.py`](https://github.com/chopratejas/headroom/blob/main/headroom/providers/litellm.py) logic. If the model is unknown, it falls back to a conservative default or the explicit `model_limit` parameter you provide. Compression still occurs, but the token savings calculation uses estimated context window sizes.

### Can I use the callback with LiteLLM’s async methods?

Yes. The `async_pre_call_hook` implementation in [`headroom/integrations/litellm_callback.py`](https://github.com/chopratejas/headroom/blob/main/headroom/integrations/litellm_callback.py) handles both synchronous `completion` and asynchronous `acompletion` calls. The callback uses `httpx.AsyncClient` for cloud requests, ensuring non-blocking I/O during the compression phase.

### Does compression add latency to my requests?

Local compression adds minimal overhead (typically milliseconds) as it runs in-process using the core `headroom.compress` module. Cloud compression introduces network latency equivalent to one HTTP POST to the Headroom API endpoint, though this is often offset by reduced token processing time on the LLM provider side.

### What happens if the Headroom Cloud service is unavailable?

The callback implements fail-open behavior. If the cloud endpoint returns an error or times out, the original `messages` payload is preserved and the request continues to the LLM provider. Errors are logged via standard Python logging, and the `total_tokens_saved` counter remains unchanged for that request.