How to Integrate Headroom with LangChain for Chat Model Compression

You can integrate Headroom with LangChain for chat model compression by wrapping any LangChain BaseChatModel with HeadroomChatModel, which intercepts every LLM call, applies the TransformPipeline, and returns optimized responses with built-in token-saving metrics.

Headroom is an open-source compression SDK that ships a first-class LangChain integration under chopratejas/headroom. By subclassing LangChain's BaseChatModel in headroom/integrations/langchain/chat_model.py, the library lets you drop compression into existing chains, agents, and retrievers without rewriting your application logic.

How Headroom Intercepts LangChain Chat Models

The HeadroomChatModel Wrapper

Inside headroom/integrations/langchain/chat_model.py, the HeadroomChatModel class subclasses LangChain's BaseChatModel. On every invocation, it performs three steps:

  1. Converts LangChain message objects into the OpenAI-style format required by Headroom.
  2. Applies the core TransformPipeline to compress, cache-bust, and filter for relevance.
  3. Converts the optimized messages back to LangChain format and forwards them to the wrapped LLM.

The heavy lifting stays in the core headroom SDK; the integration layer only handles message translation and metric bookkeeping.

Provider Auto-Detection

When you instantiate HeadroomChatModel with auto_detect_provider=True (the default), the wrapper inspects the wrapped model class—such as ChatOpenAI or ChatAnthropic—and selects the matching Headroom provider (e.g., OpenAIProvider, AnthropicProvider). According to the source code in headroom/integrations/langchain/providers.py, this ensures accurate token counting and enables provider-specific caching optimizations.

Optimization Metrics

Each pass through the pipeline produces an OptimizationMetrics record that tracks tokens before and after, savings percentage, and applied transforms. The wrapper aggregates these into total_tokens_saved and exposes get_savings_summary() for quick reporting.

Async and Streaming Support

The wrapper implements _stream, _agenerate, and _astream, so both synchronous and asynchronous LangChain APIs work out of the box. You can call invoke, stream, ainvoke, and astream on a HeadroomChatModel exactly like the underlying model.

Install the LangChain Extra

Install Headroom with the LangChain extra to pull in the required integration files:

pip install "headroom-ai[langchain]"

Wrap Any LangChain Chat Model

The following example shows the basic usage of HeadroomChatModel to compress chat interactions:

from langchain_openai import ChatOpenAI
from headroom.integrations import HeadroomChatModel

llm = HeadroomChatModel(ChatOpenAI(model="gpt-4o"))

# Use exactly like the original model

response = llm.invoke([HumanMessage(content="Explain quantum tunnelling.")])
print(response.content)

# View token-saving stats

print(llm.get_savings_summary())

Advanced LangChain Integration Examples

Compress Memory with HeadroomChatMessageHistory

For long conversations, HeadroomChatMessageHistory in headroom/integrations/langchain/memory.py wraps any ChatMessageHistory to compress old turns automatically:

from langchain.memory import ConversationBufferMemory
from langchain_community.chat_message_histories import ChatMessageHistory
from headroom.integrations import HeadroomChatMessageHistory

base_history = ChatMessageHistory()
compressed_history = HeadroomChatMessageHistory(
    base_history,
    compress_threshold_tokens=4000,
    keep_recent_turns=5,
)

memory = ConversationBufferMemory(chat_memory=compressed_history)
chain = ConversationChain(
    llm=HeadroomChatModel(ChatOpenAI(model="gpt-4o")),
    memory=memory
)

Compress Retrieved Documents

Use HeadroomDocumentCompressor from headroom/integrations/langchain/retriever.py as a BaseDocumentCompressor inside a ContextualCompressionRetriever:

from langchain.retrievers import ContextualCompressionRetriever
from langchain_community.vectorstores import FAISS
from headroom.integrations import HeadroomDocumentCompressor

vectorstore = FAISS.from_documents(docs, embeddings)
base_retriever = vectorstore.as_retriever(search_kwargs={"k": 50})

compressor = HeadroomDocumentCompressor(
    max_documents=10,
    min_relevance=0.3,
    prefer_diverse=True
)

retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=base_retriever,
)

Wrap Agent Tools to Compress Large Outputs

The wrap_tools_with_headroom helper in headroom/integrations/langchain/agents.py decorates LangChain tools so large tool outputs are compressed before the next LLM turn:

from langchain_core.tools import tool
from headroom.integrations import wrap_tools_with_headroom

@tool
def search_web(query: str) -> str:
    """Return a large JSON payload."""
    ...

tools = wrap_tools_with_headroom([search_web], min_chars_to_compress=1000)

llm = HeadroomChatModel(ChatOpenAI(model="gpt-4o"))
agent = create_openai_tools_agent(llm, tools, prompt)
executor = AgentExecutor(agent=agent, tools=tools)
result = executor.invoke({"input": "Find recent Python libraries"})

Async Streaming

Because HeadroomChatModel implements _agenerate and _astream, async patterns work identically to the wrapped model:

async def async_chat():
    llm = HeadroomChatModel(ChatOpenAI(model="gpt-4o"))
    result = await llm.ainvoke([HumanMessage(content="What's the weather in Paris?")])
    print(result.content)

    async for chunk in llm.astream([HumanMessage(content="Tell a joke.")]):
        print(chunk.content, end="", flush=True)

# asyncio.run(async_chat())

Track Streaming Metrics in Real Time

StreamingMetricsTracker in headroom/integrations/langchain/streaming.py records token usage while streaming:

from headroom.integrations import StreamingMetricsTracker

tracker = StreamingMetricsTracker(model="gpt-4o")
for chunk in llm.stream([HumanMessage(content="Write a poem.")]):
    tracker.add_chunk(chunk)
    print(chunk.content, end="")

metrics = tracker.finish()
print(f"Output tokens: {metrics.output_tokens}, duration: {metrics.duration_ms:.0f} ms")

Key Integration Files

All LangChain-specific components live under headroom/integrations/langchain/ in the chopratejas/headroom repository:

Summary

  • Wrap any chat model with HeadroomChatModel in headroom/integrations/langchain/chat_model.py to apply Headroom compression transparently.
  • Provider auto-detection in providers.py matches your LangChain model to the correct Headroom provider for accurate token counting.
  • Memory, retriever, and tool integrations compress conversation history, retrieved documents, and agent tool outputs respectively.
  • Async and streaming APIs are fully supported through _stream, _agenerate, and _astream.
  • Built-in metrics via OptimizationMetrics and get_savings_summary() let you measure token savings on every call.

Frequently Asked Questions

Which LangChain chat models are compatible with Headroom?

Any model that implements LangChain's BaseChatModel interface is compatible. When auto_detect_provider=True, HeadroomChatModel inspects the wrapped class—such as ChatOpenAI or ChatAnthropic—and selects the corresponding Headroom provider for accurate tokenization. If your model is not auto-detected, you can configure the provider manually.

Does Headroom support async and streaming in LangChain?

Yes. The HeadroomChatModel wrapper implements _stream, _agenerate, and _astream, so both synchronous and asynchronous LangChain APIs work without modification. You can call invoke, stream, ainvoke, and astream exactly as you would on the underlying model.

How do I view token savings after a compressed chat call?

Each optimization pass generates an OptimizationMetrics record tracking tokens before and after compression. The wrapper aggregates these into total_tokens_saved and exposes get_savings_summary() for a quick report. For streaming scenarios, use StreamingMetricsTracker from headroom/integrations/langchain/streaming.py to collect real-time usage statistics.

Can I compress old conversation turns automatically?

Yes. Use HeadroomChatMessageHistory from headroom/integrations/langchain/memory.py to wrap any LangChain ChatMessageHistory. Set compress_threshold_tokens to define when compression triggers and keep_recent_turns to preserve the most recent exchanges uncompressed. This is ideal for long-running conversation chains that would otherwise exceed context limits.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →