# How to Integrate Headroom with LangChain for Chat Model Compression

> Integrate Headroom with LangChain for chat model compression. Wrap your LangChain chat model with HeadroomChatModel for optimized responses and token savings.

- Repository: [Tejas Chopra/headroom](https://github.com/chopratejas/headroom)
- Tags: how-to-guide
- Published: 2026-06-05

---

**You can integrate Headroom with LangChain for chat model compression by wrapping any LangChain `BaseChatModel` with `HeadroomChatModel`, which intercepts every LLM call, applies the `TransformPipeline`, and returns optimized responses with built-in token-saving metrics.**

Headroom is an open-source compression SDK that ships a first-class LangChain integration under `chopratejas/headroom`. By subclassing LangChain's `BaseChatModel` in [`headroom/integrations/langchain/chat_model.py`](https://github.com/chopratejas/headroom/blob/main/headroom/integrations/langchain/chat_model.py), the library lets you drop compression into existing chains, agents, and retrievers without rewriting your application logic.

## How Headroom Intercepts LangChain Chat Models

### The HeadroomChatModel Wrapper

Inside [`headroom/integrations/langchain/chat_model.py`](https://github.com/chopratejas/headroom/blob/main/headroom/integrations/langchain/chat_model.py), the `HeadroomChatModel` class subclasses LangChain's `BaseChatModel`. On every invocation, it performs three steps:

1. Converts LangChain message objects into the OpenAI-style format required by Headroom.
2. Applies the core `TransformPipeline` to compress, cache-bust, and filter for relevance.
3. Converts the optimized messages back to LangChain format and forwards them to the wrapped LLM.

The heavy lifting stays in the core `headroom` SDK; the integration layer only handles message translation and metric bookkeeping.

### Provider Auto-Detection

When you instantiate `HeadroomChatModel` with `auto_detect_provider=True` (the default), the wrapper inspects the wrapped model class—such as `ChatOpenAI` or `ChatAnthropic`—and selects the matching Headroom provider (e.g., `OpenAIProvider`, `AnthropicProvider`). According to the source code in [`headroom/integrations/langchain/providers.py`](https://github.com/chopratejas/headroom/blob/main/headroom/integrations/langchain/providers.py), this ensures accurate token counting and enables provider-specific caching optimizations.

### Optimization Metrics

Each pass through the pipeline produces an `OptimizationMetrics` record that tracks tokens before and after, savings percentage, and applied transforms. The wrapper aggregates these into `total_tokens_saved` and exposes `get_savings_summary()` for quick reporting.

### Async and Streaming Support

The wrapper implements `_stream`, `_agenerate`, and `_astream`, so both synchronous and asynchronous LangChain APIs work out of the box. You can call `invoke`, `stream`, `ainvoke`, and `astream` on a `HeadroomChatModel` exactly like the underlying model.

## Install the LangChain Extra

Install Headroom with the LangChain extra to pull in the required integration files:

```bash
pip install "headroom-ai[langchain]"

```

## Wrap Any LangChain Chat Model

The following example shows the basic usage of `HeadroomChatModel` to compress chat interactions:

```python
from langchain_openai import ChatOpenAI
from headroom.integrations import HeadroomChatModel

llm = HeadroomChatModel(ChatOpenAI(model="gpt-4o"))

# Use exactly like the original model

response = llm.invoke([HumanMessage(content="Explain quantum tunnelling.")])
print(response.content)

# View token-saving stats

print(llm.get_savings_summary())

```

## Advanced LangChain Integration Examples

### Compress Memory with HeadroomChatMessageHistory

For long conversations, `HeadroomChatMessageHistory` in [`headroom/integrations/langchain/memory.py`](https://github.com/chopratejas/headroom/blob/main/headroom/integrations/langchain/memory.py) wraps any `ChatMessageHistory` to compress old turns automatically:

```python
from langchain.memory import ConversationBufferMemory
from langchain_community.chat_message_histories import ChatMessageHistory
from headroom.integrations import HeadroomChatMessageHistory

base_history = ChatMessageHistory()
compressed_history = HeadroomChatMessageHistory(
    base_history,
    compress_threshold_tokens=4000,
    keep_recent_turns=5,
)

memory = ConversationBufferMemory(chat_memory=compressed_history)
chain = ConversationChain(
    llm=HeadroomChatModel(ChatOpenAI(model="gpt-4o")),
    memory=memory
)

```

### Compress Retrieved Documents

Use `HeadroomDocumentCompressor` from [`headroom/integrations/langchain/retriever.py`](https://github.com/chopratejas/headroom/blob/main/headroom/integrations/langchain/retriever.py) as a `BaseDocumentCompressor` inside a `ContextualCompressionRetriever`:

```python
from langchain.retrievers import ContextualCompressionRetriever
from langchain_community.vectorstores import FAISS
from headroom.integrations import HeadroomDocumentCompressor

vectorstore = FAISS.from_documents(docs, embeddings)
base_retriever = vectorstore.as_retriever(search_kwargs={"k": 50})

compressor = HeadroomDocumentCompressor(
    max_documents=10,
    min_relevance=0.3,
    prefer_diverse=True
)

retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=base_retriever,
)

```

### Wrap Agent Tools to Compress Large Outputs

The `wrap_tools_with_headroom` helper in [`headroom/integrations/langchain/agents.py`](https://github.com/chopratejas/headroom/blob/main/headroom/integrations/langchain/agents.py) decorates LangChain tools so large tool outputs are compressed before the next LLM turn:

```python
from langchain_core.tools import tool
from headroom.integrations import wrap_tools_with_headroom

@tool
def search_web(query: str) -> str:
    """Return a large JSON payload."""
    ...

tools = wrap_tools_with_headroom([search_web], min_chars_to_compress=1000)

llm = HeadroomChatModel(ChatOpenAI(model="gpt-4o"))
agent = create_openai_tools_agent(llm, tools, prompt)
executor = AgentExecutor(agent=agent, tools=tools)
result = executor.invoke({"input": "Find recent Python libraries"})

```

### Async Streaming

Because `HeadroomChatModel` implements `_agenerate` and `_astream`, async patterns work identically to the wrapped model:

```python
async def async_chat():
    llm = HeadroomChatModel(ChatOpenAI(model="gpt-4o"))
    result = await llm.ainvoke([HumanMessage(content="What's the weather in Paris?")])
    print(result.content)

    async for chunk in llm.astream([HumanMessage(content="Tell a joke.")]):
        print(chunk.content, end="", flush=True)

# asyncio.run(async_chat())

```

### Track Streaming Metrics in Real Time

`StreamingMetricsTracker` in [`headroom/integrations/langchain/streaming.py`](https://github.com/chopratejas/headroom/blob/main/headroom/integrations/langchain/streaming.py) records token usage while streaming:

```python
from headroom.integrations import StreamingMetricsTracker

tracker = StreamingMetricsTracker(model="gpt-4o")
for chunk in llm.stream([HumanMessage(content="Write a poem.")]):
    tracker.add_chunk(chunk)
    print(chunk.content, end="")

metrics = tracker.finish()
print(f"Output tokens: {metrics.output_tokens}, duration: {metrics.duration_ms:.0f} ms")

```

## Key Integration Files

All LangChain-specific components live under `headroom/integrations/langchain/` in the `chopratejas/headroom` repository:

- [`headroom/integrations/langchain/chat_model.py`](https://github.com/chopratejas/headroom/blob/main/headroom/integrations/langchain/chat_model.py) — `HeadroomChatModel` wrapper and optimization logic.
- [`headroom/integrations/langchain/providers.py`](https://github.com/chopratejas/headroom/blob/main/headroom/integrations/langchain/providers.py) — Provider auto-detection utilities.
- [`headroom/integrations/langchain/memory.py`](https://github.com/chopratejas/headroom/blob/main/headroom/integrations/langchain/memory.py) — `HeadroomChatMessageHistory` for automatic memory compression.
- [`headroom/integrations/langchain/retriever.py`](https://github.com/chopratejas/headroom/blob/main/headroom/integrations/langchain/retriever.py) — `HeadroomDocumentCompressor` for retrieval pipelines.
- [`headroom/integrations/langchain/agents.py`](https://github.com/chopratejas/headroom/blob/main/headroom/integrations/langchain/agents.py) — `wrap_tools_with_headroom` for tool output compression.
- [`headroom/integrations/langchain/streaming.py`](https://github.com/chopratejas/headroom/blob/main/headroom/integrations/langchain/streaming.py) — `StreamingMetricsTracker` and `StreamingMetricsCallback`.
- [`headroom/integrations/langchain/langsmith.py`](https://github.com/chopratejas/headroom/blob/main/headroom/integrations/langchain/langsmith.py) — LangChain-specific observability callbacks.
- [`wiki/langchain.md`](https://github.com/chopratejas/headroom/blob/main/wiki/langchain.md) — High-level integration documentation.

## Summary

- **Wrap any chat model** with `HeadroomChatModel` in [`headroom/integrations/langchain/chat_model.py`](https://github.com/chopratejas/headroom/blob/main/headroom/integrations/langchain/chat_model.py) to apply Headroom compression transparently.
- **Provider auto-detection** in [`providers.py`](https://github.com/chopratejas/headroom/blob/main/providers.py) matches your LangChain model to the correct Headroom provider for accurate token counting.
- **Memory, retriever, and tool integrations** compress conversation history, retrieved documents, and agent tool outputs respectively.
- **Async and streaming** APIs are fully supported through `_stream`, `_agenerate`, and `_astream`.
- **Built-in metrics** via `OptimizationMetrics` and `get_savings_summary()` let you measure token savings on every call.

## Frequently Asked Questions

### Which LangChain chat models are compatible with Headroom?

Any model that implements LangChain's `BaseChatModel` interface is compatible. When `auto_detect_provider=True`, `HeadroomChatModel` inspects the wrapped class—such as `ChatOpenAI` or `ChatAnthropic`—and selects the corresponding Headroom provider for accurate tokenization. If your model is not auto-detected, you can configure the provider manually.

### Does Headroom support async and streaming in LangChain?

Yes. The `HeadroomChatModel` wrapper implements `_stream`, `_agenerate`, and `_astream`, so both synchronous and asynchronous LangChain APIs work without modification. You can call `invoke`, `stream`, `ainvoke`, and `astream` exactly as you would on the underlying model.

### How do I view token savings after a compressed chat call?

Each optimization pass generates an `OptimizationMetrics` record tracking tokens before and after compression. The wrapper aggregates these into `total_tokens_saved` and exposes `get_savings_summary()` for a quick report. For streaming scenarios, use `StreamingMetricsTracker` from [`headroom/integrations/langchain/streaming.py`](https://github.com/chopratejas/headroom/blob/main/headroom/integrations/langchain/streaming.py) to collect real-time usage statistics.

### Can I compress old conversation turns automatically?

Yes. Use `HeadroomChatMessageHistory` from [`headroom/integrations/langchain/memory.py`](https://github.com/chopratejas/headroom/blob/main/headroom/integrations/langchain/memory.py) to wrap any LangChain `ChatMessageHistory`. Set `compress_threshold_tokens` to define when compression triggers and `keep_recent_turns` to preserve the most recent exchanges uncompressed. This is ideal for long-running conversation chains that would otherwise exceed context limits.