# How to Use Headroom with LangChain's ChatModel for Context Compression

> Compress chat contexts using Headroom with LangChain's ChatModel. Reduce token usage while maintaining conversation quality with this simple wrapper for LLM calls.

- Repository: [Tejas Chopra/headroom](https://github.com/chopratejas/headroom)
- Tags: how-to-guide
- Published: 2026-06-06

---

**Wrap any LangChain `BaseChatModel` with `HeadroomChatModel` to automatically compress chat contexts before each LLM call, reducing token usage while preserving conversation quality.**

The Headroom library ([chopratejas/headroom](https://github.com/chopratejas/headroom)) provides a first-class integration that intercepts messages between LangChain and your LLM, applying intelligent compression via the `TransformPipeline`. This integration subclasses LangChain's `BaseChatModel` to ensure full compatibility with existing chains and agents.

## How the Integration Works

The LangChain integration centers on `HeadroomChatModel`, located in [`headroom/integrations/langchain/chat_model.py`](https://github.com/chopratejas/headroom/blob/main/headroom/integrations/langchain/chat_model.py). This wrapper intercepts every call to the wrapped LLM, converts LangChain message objects to the OpenAI-style format required by Headroom, runs the `TransformPipeline`, and converts the optimized messages back to LangChain format before sending to the underlying model.

### Provider Auto-Detection

When `auto_detect_provider=True` (the default), the wrapper inspects the wrapped model class (e.g., `ChatOpenAI`, `ChatAnthropic`) and automatically selects the matching Headroom provider (`OpenAIProvider`, `AnthropicProvider`, etc.). This ensures accurate token counting and enables provider-specific caching optimizations.

### Metrics and Observability

Each optimization pass produces an `OptimizationMetrics` record containing tokens before/after, savings percentage, and applied transforms. The wrapper aggregates these into `total_tokens_saved` and exposes `get_savings_summary()` for quick reporting.

## Installation

Install Headroom with the LangChain extra to access all integration components:

```bash
pip install "headroom-ai[langchain]"

```

This installs the `headroom.integrations.langchain` subpackage, which includes chat model wrappers, memory helpers, retriever components, and streaming utilities.

## Basic Usage: Wrapping Chat Models

To enable compression, wrap your existing LangChain chat model with `HeadroomChatModel`:

```python
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage
from headroom.integrations import HeadroomChatModel

# Wrap the model

llm = HeadroomChatModel(ChatOpenAI(model="gpt-4o"))

# Use exactly like the original model

response = llm.invoke([HumanMessage(content="Explain quantum tunnelling.")])
print(response.content)

# View token-saving stats

print(llm.get_savings_summary())

```

The wrapped model supports all standard LangChain invocation methods (`invoke`, `batch`, `stream`) while transparently compressing context windows.

## Advanced Integration Patterns

### Async and Streaming Support

The wrapper implements `_stream`, `_agenerate`, and `_astream`, enabling asynchronous and streaming APIs:

```python
async def async_chat():
    llm = HeadroomChatModel(ChatOpenAI(model="gpt-4o"))
    
    # Async invoke

    result = await llm.ainvoke([HumanMessage(content="What's the weather in Paris?")])
    print(result.content)
    
    # Async streaming

    async for chunk in llm.astream([HumanMessage(content="Tell a joke.")]):
        print(chunk.content, end="", flush=True)

# asyncio.run(async_chat())

```

### Memory Compression with HeadroomChatMessageHistory

For long-running conversations, use `HeadroomChatMessageHistory` to compress old turns automatically while preserving recent context:

```python
from langchain.memory import ConversationBufferMemory
from langchain_community.chat_message_histories import ChatMessageHistory
from headroom.integrations import HeadroomChatMessageHistory
from langchain.chains import ConversationChain

base_history = ChatMessageHistory()
compressed_history = HeadroomChatMessageHistory(
    base_history,
    compress_threshold_tokens=4000,   # compress when >4K tokens

    keep_recent_turns=5,              # always keep the last 5 turns

)

memory = ConversationBufferMemory(chat_memory=compressed_history)
chain = ConversationChain(
    llm=HeadroomChatModel(ChatOpenAI(model="gpt-4o")), 
    memory=memory
)

```

This implementation, found in [`headroom/integrations/langchain/memory.py`](https://github.com/chopratejas/headroom/blob/main/headroom/integrations/langchain/memory.py), ensures that only the most recent and salient conversation history reaches the LLM.

### Document Retrieval Compression

Use `HeadroomDocumentCompressor` inside LangChain's `ContextualCompressionRetriever` to filter retrieved documents:

```python
from langchain.retrievers import ContextualCompressionRetriever
from langchain_community.vectorstores import FAISS
from headroom.integrations import HeadroomDocumentCompressor

vectorstore = FAISS.from_documents(docs, embeddings)
base_retriever = vectorstore.as_retriever(search_kwargs={"k": 50})

compressor = HeadroomDocumentCompressor(
    max_documents=10, 
    min_relevance=0.3, 
    prefer_diverse=True
)

retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=base_retriever,
)

```

Found in [`headroom/integrations/langchain/retriever.py`](https://github.com/chopratejas/headroom/blob/main/headroom/integrations/langchain/retriever.py), this component intelligently selects the most relevant documents from large retrieval sets.

### Tool Output Compression

For agents that generate large tool outputs, use `wrap_tools_with_headroom` (from [`headroom/integrations/langchain/agents.py`](https://github.com/chopratejas/headroom/blob/main/headroom/integrations/langchain/agents.py)) to compress results before they hit the LLM context window:

```python
from langchain_core.tools import tool
from headroom.integrations import wrap_tools_with_headroom

@tool
def search_web(query: str) -> str:
    """Return a large JSON payload."""
    ...

tools = wrap_tools_with_headroom([search_web], min_chars_to_compress=1000)

```

## Monitoring Token Savings

Track real-time compression metrics during streaming with `StreamingMetricsTracker`:

```python
from headroom.integrations import StreamingMetricsTracker

tracker = StreamingMetricsTracker(model="gpt-4o")
for chunk in llm.stream([HumanMessage(content="Write a poem.")]):
    tracker.add_chunk(chunk)
    print(chunk.content, end="")

metrics = tracker.finish()
print(f"Output tokens: {metrics.output_tokens}, Saved: {tracker.total_saved}")

```

Alternatively, use `StreamingMetricsCallback` for LangChain callback-based observability.

## Key Source Files

The integration is modularized across the following locations:

- **[`headroom/integrations/langchain/chat_model.py`](https://github.com/chopratejas/headroom/blob/main/headroom/integrations/langchain/chat_model.py)** – Core `HeadroomChatModel` wrapper and optimization logic
- **[`headroom/integrations/langchain/providers.py`](https://github.com/chopratejas/headroom/blob/main/headroom/integrations/langchain/providers.py)** – Provider auto-detection utilities
- **[`headroom/integrations/langchain/memory.py`](https://github.com/chopratejas/headroom/blob/main/headroom/integrations/langchain/memory.py)** – `HeadroomChatMessageHistory` for conversation compression
- **[`headroom/integrations/langchain/retriever.py`](https://github.com/chopratejas/headroom/blob/main/headroom/integrations/langchain/retriever.py)** – `HeadroomDocumentCompressor` for RAG pipelines
- **[`headroom/integrations/langchain/agents.py`](https://github.com/chopratejas/headroom/blob/main/headroom/integrations/langchain/agents.py)** – Tool-wrapping helpers (`wrap_tools_with_headroom`)
- **[`headroom/integrations/langchain/streaming.py`](https://github.com/chopratejas/headroom/blob/main/headroom/integrations/langchain/streaming.py)** – `StreamingMetricsTracker` and async streaming utilities
- **[`headroom/integrations/langchain/langsmith.py`](https://github.com/chopratejas/headroom/blob/main/headroom/integrations/langchain/langsmith.py)** – LangSmith-specific callbacks for observability

## Summary

- **Wrap existing models**: Use `HeadroomChatModel` to add compression to any LangChain `BaseChatModel` without changing your invocation code.
- **Automatic provider detection**: The wrapper identifies OpenAI, Anthropic, and other providers to ensure accurate token counting.
- **Full async support**: All methods (`invoke`, `stream`, `ainvoke`, `astream`) work with compression enabled.
- **Memory and retrieval**: Specialized classes handle conversation history and document compression automatically.
- **Built-in metrics**: Access `get_savings_summary()` and `OptimizationMetrics` to quantify token reductions.

## Frequently Asked Questions

### Does Headroom support streaming with LangChain?

Yes. `HeadroomChatModel` implements `_stream` and `_astream` methods, allowing you to use both synchronous `llm.stream()` and asynchronous `llm.astream()` methods. The compression occurs on the full context before streaming begins, ensuring the wrapped LLM receives optimized input while still streaming output tokens incrementally.

### How does the provider auto-detection work?

When `auto_detect_provider=True` (default), the wrapper inspects the class name of the wrapped model (e.g., `ChatOpenAI`, `ChatAnthropic`) and maps it to the corresponding Headroom provider class (`OpenAIProvider`, `AnthropicProvider`). This mapping ensures correct token limits, pricing calculations, and provider-specific caching strategies are applied without manual configuration.

### Can I use Headroom with existing LangChain memory classes?

Yes. Instead of replacing your memory implementation, wrap the underlying `ChatMessageHistory` with `HeadroomChatMessageHistory`. This allows you to use standard LangChain memory classes like `ConversationBufferMemory` while automatically compressing historical turns that exceed your `compress_threshold_tokens` limit, keeping the most recent `keep_recent_turns` uncompressed for immediate context.

### What metrics does Headroom provide for tracking compression?

Each optimization generates an `OptimizationMetrics` object containing `tokens_before`, `tokens_after`, `savings_percent`, and applied transforms. The `HeadroomChatModel` aggregates these into `total_tokens_saved`, accessible via `get_savings_summary()`. For streaming scenarios, `StreamingMetricsTracker` provides real-time token accounting including output token counts and generation duration.