# How to Integrate Headroom with LangChain for Context Compression

> Easily integrate Headroom with LangChain for efficient context compression. Wrap your chat models to automatically trim prompts, save costs, and maintain essential information within context limits.

- Repository: [Tejas Chopra/headroom](https://github.com/chopratejas/headroom)
- Tags: how-to-guide
- Published: 2026-06-08

---

**Wrap any LangChain `BaseChatModel` in `HeadroomChatModel` to automatically compress, cache, and trim prompts before they reach the LLM, preserving essential information while staying inside the model's context window.**

Headroom is a plug-in-style SDK that optimizes LLM prompts through automatic context compression. When combined with LangChain, the `HeadroomChatModel` wrapper intercepts outbound messages, applies a transformation pipeline, and forwards the optimized payload to your underlying chat model. According to the `chopratejas/headroom` source code, this integration requires no changes to existing chain logic and supports sync, async, streaming, and tool-bound workflows.

## Why Wrap Instead of Using a Callback?

LangChain callbacks cannot mutate the message list due to a framework design limitation. Because of this constraint, Headroom implements a wrapper rather than a callback, ensuring that **all** outbound messages are processed regardless of whether the user calls `invoke`, `batch`, `stream`, or `ainvoke`.

## How HeadroomChatModel Works

The core integration lives in [`headroom/integrations/langchain/chat_model.py`](https://github.com/chopratejas/headroom/blob/main/headroom/integrations/langchain/chat_model.py). The wrapper fully implements the LangChain `BaseChatModel` API and runs every request through a multi-stage optimization pass.

### Message Conversion

LangChain messages—including `SystemMessage`, `HumanMessage`, `AIMessage`, and `ToolMessage`—are translated to an OpenAI-compatible JSON format and back. This bidirectional mapping is handled by `_convert_messages_to_openai` and `_convert_messages_from_openai` inside [`headroom/integrations/langchain/chat_model.py`](https://github.com/chopratejas/headroom/blob/main/headroom/integrations/langchain/chat_model.py) (lines 47–86).

### Pipeline Creation and Auto-Detection

On first use, the wrapper lazily builds a `TransformPipeline` (`headroom.transforms.pipeline.TransformPipeline`) that holds all compression transforms such as `SmartCrusher` and `CacheAligner`. The provider—OpenAI, Anthropic, or another—is auto-detected from the wrapped model via `get_headroom_provider` in [`headroom/integrations/langchain/chat_model.py`](https://github.com/chopratejas/headroom/blob/main/headroom/integrations/langchain/chat_model.py) (lines 24–33).

### Optimization Pass

Before each LLM call, messages are passed through `self.pipeline.apply(...)`. The pipeline returns a `Result` object containing the optimized message list and token statistics (`tokens_before`, `tokens_after`, `transforms_applied`). These values are stored in an `OptimizationMetrics` dataclass and aggregated on the model instance through `_metrics_history` and `_total_tokens_saved` (lines 120–138 and 48–75).

### Delegation and Observability

After optimization, the reduced message set is handed to the original LangChain model’s `_generate`, `_stream`, `_agenerate`, or `_astream` methods. You can inspect savings via `HeadroomChatModel.get_savings_summary()`, and the companion `HeadroomCallbackHandler` exposes token-saving statistics for cost-control dashboards (lines 104–112).

## Basic Integration Examples

### Minimal Sync Wrapper

```python
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage
from headroom.integrations import HeadroomChatModel

base = ChatOpenAI(model="gpt-4o-mini")
optimized = HeadroomChatModel(base)  # ← adds compression

resp = optimized.invoke([HumanMessage("Summarize the latest news.")])
print(resp.content)

```

### Async Usage

```python
import asyncio
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage
from headroom.integrations import HeadroomChatModel

async def main():
    base = ChatOpenAI(model="gpt-4o")
    optimized = HeadroomChatModel(base)
    result = await optimized.ainvoke([HumanMessage("What is the meaning of life?")])
    print(result.content)

asyncio.run(main())

```

### Streaming with Token Savings

```python
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage
from headroom.integrations import HeadroomChatModel

base = ChatOpenAI(model="gpt-4o", streaming=True)
optimized = HeadroomChatModel(base)

for chunk in optimized.stream([HumanMessage("Write a poem about AI.")]):
    print(chunk.content, end="", flush=True)

print("\nTokens saved:", optimized.total_tokens_saved)

```

## Advanced Patterns

### Custom Configuration

Supply a `HeadroomConfig` to tweak thresholds, enable rolling windows, or adjust compression behavior:

```python
from headroom import HeadroomConfig, HeadroomMode
from langchain_openai import ChatOpenAI
from headroom.integrations import HeadroomChatModel

config = HeadroomConfig(
    smart_crusher_threshold=400,   # compress tool output > 400 tokens

    smart_crusher_max_items=15,    # keep only the top 15 items

    cache_alignment=True,
    rolling_window=True,
)

base = ChatOpenAI(model="gpt-4o")
headroom_llm = HeadroomChatModel(base, config=config, mode=HeadroomMode.OPTIMIZE)

```

### Tool Binding

Because `HeadroomChatModel` implements the full `BaseChatModel` interface, tool binding works exactly like a normal LangChain model:

```python
tools = [search_tool, docs_tool]  # LangChain @tool functions

headroom_llm = headroom_llm.bind_tools(tools)  # type: ignore[arg-type]

```

### LCEL Composition

For LangChain Expression Language (LCEL) chains, insert the `HeadroomRunnable` between your prompt and LLM. The runnable is defined in [`headroom/integrations/langchain/runnable.py`](https://github.com/chopratejas/headroom/blob/main/headroom/integrations/langchain/runnable.py):

```python
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from headroom.integrations import HeadroomRunnable

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a concise assistant."),
    ("user", "{input}"),
])
llm = ChatOpenAI(model="gpt-4o")
chain = prompt | HeadroomRunnable() | llm

print(chain.invoke({"input": "List the steps to brew coffee."}))

```

## Inspecting Token Savings

After running inference, retrieve aggregated statistics with `get_savings_summary()`:

```python
summary = optimized.get_savings_summary()
print("Total tokens saved:", summary["total_tokens_saved"])
print("Average savings %:", summary["average_savings_percent"])

```

The underlying data is tracked in `_metrics_history` and `_total_tokens_saved` on the `HeadroomChatModel` instance (lines 120–138).

## Real-World Demo Reference

The repository ships an end-to-end demo at [`examples/langchain_demo/run_comparison.py`](https://github.com/chopratejas/headroom/blob/main/examples/langchain_demo/run_comparison.py) (lines 41–68). The script creates a `HeadroomChatModel`, runs realistic support-scenario queries with and without compression, and prints token-saving statistics. It is the best reference for wiring together multi-tool agents and observing before/after behavior.

## Summary

- **Wrap, don't callback:** Because LangChain callbacks cannot mutate messages, Headroom uses `HeadroomChatModel` in [`headroom/integrations/langchain/chat_model.py`](https://github.com/chopratejas/headroom/blob/main/headroom/integrations/langchain/chat_model.py) to intercept and optimize every request.
- **Auto-detection:** The provider is auto-detected via `get_headroom_provider` in [`headroom/integrations/langchain/providers.py`](https://github.com/chopratejas/headroom/blob/main/headroom/integrations/langchain/providers.py), and messages are converted using `_convert_messages_to_openai` and `_convert_messages_from_openai`.
- **Pipeline architecture:** A lazy `TransformPipeline` applies compression transforms such as `SmartCrusher` and `CacheAligner` before the call reaches `_generate`, `_stream`, `_agenerate`, or `_astream`.
- **Full API coverage:** The wrapper supports sync, async, streaming, tool binding, and LCEL composition through `HeadroomRunnable`.
- **Observability built-in:** Token savings are aggregated in `OptimizationMetrics` and exposed via `get_savings_summary()` and `HeadroomCallbackHandler`.

## Frequently Asked Questions

### Does Headroom work with any LangChain chat model?

Yes. `HeadroomChatModel` wraps any class that inherits from LangChain's `BaseChatModel`, including models from `langchain-openai`, `langchain-anthropic`, and `langchain-google-genai`. The provider is auto-detected in [`headroom/integrations/langchain/providers.py`](https://github.com/chopratejas/headroom/blob/main/headroom/integrations/langchain/providers.py) so the correct compression profile is selected automatically.

### Can I use Headroom with async and streaming methods?

Yes. The wrapper delegates to the underlying model's `_generate`, `_stream`, `_agenerate`, and `_astream` methods after optimization. You can call `invoke`, `stream`, `ainvoke`, or `astream` on a `HeadroomChatModel` instance exactly as you would on the base model.

### Why can't I just use a LangChain callback for context compression?

LangChain callbacks are designed for observation and logging, not mutation. Due to a framework limitation, callbacks cannot modify the message list before it is sent to the LLM. Headroom solves this by wrapping the model itself, guaranteeing that all outbound traffic passes through the compression pipeline.

### Where are the compression transforms implemented?

Individual transforms such as `SmartCrusher` and `CacheAligner` live in `headroom/transforms/*.py`. The wrapper assembles them into a `TransformPipeline` (`headroom.transforms.pipeline.TransformPipeline`) that is lazily instantiated when the first request is made.