how-to-guide

How to Use Headroom with LangChain's ChatModel for Context Compression

June 6, 2026 chopratejas/headroom ↗

Wrap any LangChain BaseChatModel with HeadroomChatModel to automatically compress chat contexts before each LLM call, reducing token usage while preserving conversation quality.

The Headroom library (chopratejas/headroom) provides a first-class integration that intercepts messages between LangChain and your LLM, applying intelligent compression via the TransformPipeline. This integration subclasses LangChain's BaseChatModel to ensure full compatibility with existing chains and agents.

How the Integration Works

The LangChain integration centers on HeadroomChatModel, located in headroom/integrations/langchain/chat_model.py. This wrapper intercepts every call to the wrapped LLM, converts LangChain message objects to the OpenAI-style format required by Headroom, runs the TransformPipeline, and converts the optimized messages back to LangChain format before sending to the underlying model.

Provider Auto-Detection

When auto_detect_provider=True (the default), the wrapper inspects the wrapped model class (e.g., ChatOpenAI, ChatAnthropic) and automatically selects the matching Headroom provider (OpenAIProvider, AnthropicProvider, etc.). This ensures accurate token counting and enables provider-specific caching optimizations.

Metrics and Observability

Each optimization pass produces an OptimizationMetrics record containing tokens before/after, savings percentage, and applied transforms. The wrapper aggregates these into total_tokens_saved and exposes get_savings_summary() for quick reporting.

Installation

Install Headroom with the LangChain extra to access all integration components:

pip install "headroom-ai[langchain]"

This installs the headroom.integrations.langchain subpackage, which includes chat model wrappers, memory helpers, retriever components, and streaming utilities.

Basic Usage: Wrapping Chat Models

To enable compression, wrap your existing LangChain chat model with HeadroomChatModel:

from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage
from headroom.integrations import HeadroomChatModel

# Wrap the model

llm = HeadroomChatModel(ChatOpenAI(model="gpt-4o"))

# Use exactly like the original model

response = llm.invoke([HumanMessage(content="Explain quantum tunnelling.")])
print(response.content)

# View token-saving stats

print(llm.get_savings_summary())

The wrapped model supports all standard LangChain invocation methods (invoke, batch, stream) while transparently compressing context windows.

Advanced Integration Patterns

Async and Streaming Support

The wrapper implements _stream, _agenerate, and _astream, enabling asynchronous and streaming APIs:

async def async_chat():
    llm = HeadroomChatModel(ChatOpenAI(model="gpt-4o"))
    
    # Async invoke

    result = await llm.ainvoke([HumanMessage(content="What's the weather in Paris?")])
    print(result.content)
    
    # Async streaming

    async for chunk in llm.astream([HumanMessage(content="Tell a joke.")]):
        print(chunk.content, end="", flush=True)

# asyncio.run(async_chat())

Memory Compression with HeadroomChatMessageHistory

For long-running conversations, use HeadroomChatMessageHistory to compress old turns automatically while preserving recent context:

from langchain.memory import ConversationBufferMemory
from langchain_community.chat_message_histories import ChatMessageHistory
from headroom.integrations import HeadroomChatMessageHistory
from langchain.chains import ConversationChain

base_history = ChatMessageHistory()
compressed_history = HeadroomChatMessageHistory(
    base_history,
    compress_threshold_tokens=4000,   # compress when >4K tokens

    keep_recent_turns=5,              # always keep the last 5 turns

)

memory = ConversationBufferMemory(chat_memory=compressed_history)
chain = ConversationChain(
    llm=HeadroomChatModel(ChatOpenAI(model="gpt-4o")), 
    memory=memory
)

This implementation, found in headroom/integrations/langchain/memory.py, ensures that only the most recent and salient conversation history reaches the LLM.

Document Retrieval Compression

Use HeadroomDocumentCompressor inside LangChain's ContextualCompressionRetriever to filter retrieved documents:

from langchain.retrievers import ContextualCompressionRetriever
from langchain_community.vectorstores import FAISS
from headroom.integrations import HeadroomDocumentCompressor

vectorstore = FAISS.from_documents(docs, embeddings)
base_retriever = vectorstore.as_retriever(search_kwargs={"k": 50})

compressor = HeadroomDocumentCompressor(
    max_documents=10, 
    min_relevance=0.3, 
    prefer_diverse=True
)

retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=base_retriever,
)

Found in headroom/integrations/langchain/retriever.py, this component intelligently selects the most relevant documents from large retrieval sets.

Tool Output Compression

For agents that generate large tool outputs, use wrap_tools_with_headroom (from headroom/integrations/langchain/agents.py) to compress results before they hit the LLM context window:

from langchain_core.tools import tool
from headroom.integrations import wrap_tools_with_headroom

@tool
def search_web(query: str) -> str:
    """Return a large JSON payload."""
    ...

tools = wrap_tools_with_headroom([search_web], min_chars_to_compress=1000)

Monitoring Token Savings

Track real-time compression metrics during streaming with StreamingMetricsTracker:

from headroom.integrations import StreamingMetricsTracker

tracker = StreamingMetricsTracker(model="gpt-4o")
for chunk in llm.stream([HumanMessage(content="Write a poem.")]):
    tracker.add_chunk(chunk)
    print(chunk.content, end="")

metrics = tracker.finish()
print(f"Output tokens: {metrics.output_tokens}, Saved: {tracker.total_saved}")

Alternatively, use StreamingMetricsCallback for LangChain callback-based observability.

Key Source Files

The integration is modularized across the following locations:

headroom/integrations/langchain/chat_model.py – Core HeadroomChatModel wrapper and optimization logic
headroom/integrations/langchain/providers.py – Provider auto-detection utilities
headroom/integrations/langchain/memory.py – HeadroomChatMessageHistory for conversation compression
headroom/integrations/langchain/retriever.py – HeadroomDocumentCompressor for RAG pipelines
headroom/integrations/langchain/agents.py – Tool-wrapping helpers (wrap_tools_with_headroom)
headroom/integrations/langchain/streaming.py – StreamingMetricsTracker and async streaming utilities
headroom/integrations/langchain/langsmith.py – LangSmith-specific callbacks for observability

Summary

Wrap existing models: Use HeadroomChatModel to add compression to any LangChain BaseChatModel without changing your invocation code.
Automatic provider detection: The wrapper identifies OpenAI, Anthropic, and other providers to ensure accurate token counting.
Full async support: All methods (invoke, stream, ainvoke, astream) work with compression enabled.
Memory and retrieval: Specialized classes handle conversation history and document compression automatically.
Built-in metrics: Access get_savings_summary() and OptimizationMetrics to quantify token reductions.

Frequently Asked Questions

Does Headroom support streaming with LangChain?

Yes. HeadroomChatModel implements _stream and _astream methods, allowing you to use both synchronous llm.stream() and asynchronous llm.astream() methods. The compression occurs on the full context before streaming begins, ensuring the wrapped LLM receives optimized input while still streaming output tokens incrementally.

How does the provider auto-detection work?

When auto_detect_provider=True (default), the wrapper inspects the class name of the wrapped model (e.g., ChatOpenAI, ChatAnthropic) and maps it to the corresponding Headroom provider class (OpenAIProvider, AnthropicProvider). This mapping ensures correct token limits, pricing calculations, and provider-specific caching strategies are applied without manual configuration.

Can I use Headroom with existing LangChain memory classes?

Yes. Instead of replacing your memory implementation, wrap the underlying ChatMessageHistory with HeadroomChatMessageHistory. This allows you to use standard LangChain memory classes like ConversationBufferMemory while automatically compressing historical turns that exceed your compress_threshold_tokens limit, keeping the most recent keep_recent_turns uncompressed for immediate context.

What metrics does Headroom provide for tracking compression?

Each optimization generates an OptimizationMetrics object containing tokens_before, tokens_after, savings_percent, and applied transforms. The HeadroomChatModel aggregates these into total_tokens_saved, accessible via get_savings_summary(). For streaming scenarios, StreamingMetricsTracker provides real-time token accounting including output token counts and generation duration.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:

curl -s "https://instagit.com/install.md"

Add to your MCP client configuration:

{
  "mcpServers": {
    "instagit": {
      "command": "npx",
      "args": ["-y", "instagit@latest"]
    }
  }
}

Ask your agent:

"Use Instagit MCP to understand how chopratejas/headroom works."

Works with

Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →