How to Integrate Headroom with LiteLLM Using HeadroomCallback
TLDR: Assign HeadroomCallback to litellm.callbacks to automatically compress every LiteLLM completion request via the async_pre_call_hook, choosing between in‑process local compression and managed cloud compression while preserving the original payload on any failure.
If you want to reduce token usage in your LiteLLM pipelines, the chopratejas/headroom repository provides a drop‑in HeadroomCallback that integrates directly with LiteLLM’s CustomLogger interface. This callback intercepts outgoing requests and compresses the messages array before it reaches the model, supporting both in‑process local compression and managed cloud compression via Headroom’s API.
How HeadroomCallback Intercepts LiteLLM Requests
The async_pre_call_hook Implementation
In headroom/integrations/litellm_callback.py, the HeadroomCallback class implements the asynchronous async_pre_call_hook method. This hook fires before every LiteLLM completion or acompletion call, giving the callback access to the original messages and model parameters. If the call type is not a completion, the callback forwards the request unchanged so other LiteLLM workflows remain unaffected.
Local vs. Cloud Compression Modes
The callback selects its compression path based on the presence of an API key:
- Local mode — This is the default. The callback invokes
_local_compress, which callsheadroom.compress.compressin‑process to shrink the payload and return token statistics. - Cloud mode — If you provide an
api_keyargument or set theHEADROOM_API_KEYenvironment variable, the callback uses_cloud_compress. This method lazily instantiates anhttpx.AsyncClientand POSTs the payload tohttps://api.headroomlabs.ai/v1/saas/compress.
After either path succeeds, the callback replaces the original messages field with the compressed version, updates its internal _total_saved counter, and logs a concise summary through the total_tokens_saved property.
Fail‑Safe Behavior
The callback is designed to be non‑blocking. If compression raises an exception—such as a missing httpx library in Cloud mode or a non‑200 API response—the callback logs a warning and returns the original request data unchanged. This ensures that a LiteLLM call never fails because of Headroom.
Integrating Headroom with LiteLLM
Install Dependencies
Install LiteLLM and Headroom. The httpx library is only required if you plan to use Cloud mode.
pip install litellm headroom
Local Mode Setup
For in‑process compression, instantiate HeadroomCallback without an API key and assign it to litellm.callbacks.
import litellm
from headroom.integrations.litellm_callback import HeadroomCallback
# Register the callback
litellm.callbacks = [HeadroomCallback()]
# Use LiteLLM as usual
response = litellm.completion(
model="gpt-4o",
messages=[
{"role": "user", "content": "Explain quantum entanglement in simple terms."}
],
)
print(response.choices[0].message["content"])
In this configuration, every completion request is compressed locally before being sent to the LLM provider.
Cloud Mode Setup
To route compression through Headroom’s managed API—enabling features like CCR, TOIN, and analytics—supply an API key.
import os
import litellm
from headroom.integrations.litellm_callback import HeadroomCallback
os.environ["HEADROOM_API_KEY"] = "hdr_XXXXXXXXXXXXXXXX"
# Or pass the key directly
litellm.callbacks = [HeadroomCallback(api_key="hdr_XXXXXXXXXXXXXXXX")]
cloud_resp = litellm.completion(
model="gpt-4o",
messages=[{"role": "user", "content": "Summarize the latest AI research."}],
)
print(cloud_resp.choices[0].message["content"])
When Cloud mode is active, _cloud_compress sends the payload to https://api.headroomlabs.ai/v1/saas/compress and returns the result unchanged to LiteLLM.
Tracking Token Savings
The callback exposes a total_tokens_saved property that accumulates savings across all successful compressions.
callback = HeadroomCallback()
litellm.callbacks = [callback]
# After running several completions ...
print(f"Total tokens saved: {callback.total_tokens_saved}")
Architecture and Source Code Overview
Callback Implementation
The HeadroomCallback class is defined in headroom/integrations/litellm_callback.py. It deliberately omits heavy observability logic: async_success_handler and async_failure_handler are no‑ops because Headroom’s telemetry hooks—such as those in headroom.integrations.strands—are attached elsewhere. The class focuses solely on pre‑call payload transformation.
Core Compression Engine
Local compression relies on headroom/compress.py, which contains the core compress function used by _local_compress. This file handles the actual message rewriting and token accounting that the callback reports.
Test Coverage
The tests/test_compress_api.py file validates that the callback imports correctly, compresses payloads as expected, and passes non‑completion call types through without modification.
Summary
- Drop‑in integration: Assign
HeadroomCallbacktolitellm.callbacksto compress every LiteLLM completion automatically viaasync_pre_call_hook. - Two modes: Use Local mode for in‑process compression, or Cloud mode by setting
HEADROOM_API_KEYor passingapi_key. - Fail‑safe design: If compression raises an exception, the callback logs a warning and returns the original payload so LiteLLM calls never fail.
- Observable savings: Access
callback.total_tokens_savedto monitor cumulative token reductions across requests. - Key source files:
headroom/integrations/litellm_callback.pyorchestrates the hook,headroom/compress.pypowers local compression, andtests/test_compress_api.pyguards regressions.
Frequently Asked Questions
How do I register HeadroomCallback with LiteLLM?
Import HeadroomCallback from headroom.integrations.litellm_callback and append an instance to the litellm.callbacks list. LiteLLM will then invoke async_pre_call_hook before every completion or acompletion request.
What is the difference between Local and Cloud mode?
Local mode runs headroom.compress inside your process via _local_compress and requires no external API key. Cloud mode uses _cloud_compress to POST messages to https://api.headroomlabs.ai/v1/saas/compress using an httpx.AsyncClient, which enables managed analytics and advanced compression strategies.
Does HeadroomCallback break LiteLLM if compression fails?
No. According to the Headroom source code, the callback wraps all compression logic in exception handling. If any error occurs, it logs a warning and returns the original request data unchanged, ensuring the downstream LiteLLM call proceeds unaffected.
How can I view the total tokens saved by the callback?
After attaching a HeadroomCallback instance to litellm.callbacks, read the total_tokens_saved property on that instance. This counter updates after every successful compression and reflects the cumulative tokens saved across all intercepted calls.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →