How to Integrate Headroom with LangChain for Chat Model Compression
You can integrate Headroom with LangChain for chat model compression by wrapping any LangChain BaseChatModel with HeadroomChatModel, which intercepts every LLM call, applies the TransformPipeline, and returns optimized responses with built-in token-saving metrics.
Headroom is an open-source compression SDK that ships a first-class LangChain integration under chopratejas/headroom. By subclassing LangChain's BaseChatModel in headroom/integrations/langchain/chat_model.py, the library lets you drop compression into existing chains, agents, and retrievers without rewriting your application logic.
How Headroom Intercepts LangChain Chat Models
The HeadroomChatModel Wrapper
Inside headroom/integrations/langchain/chat_model.py, the HeadroomChatModel class subclasses LangChain's BaseChatModel. On every invocation, it performs three steps:
- Converts LangChain message objects into the OpenAI-style format required by Headroom.
- Applies the core
TransformPipelineto compress, cache-bust, and filter for relevance. - Converts the optimized messages back to LangChain format and forwards them to the wrapped LLM.
The heavy lifting stays in the core headroom SDK; the integration layer only handles message translation and metric bookkeeping.
Provider Auto-Detection
When you instantiate HeadroomChatModel with auto_detect_provider=True (the default), the wrapper inspects the wrapped model class—such as ChatOpenAI or ChatAnthropic—and selects the matching Headroom provider (e.g., OpenAIProvider, AnthropicProvider). According to the source code in headroom/integrations/langchain/providers.py, this ensures accurate token counting and enables provider-specific caching optimizations.
Optimization Metrics
Each pass through the pipeline produces an OptimizationMetrics record that tracks tokens before and after, savings percentage, and applied transforms. The wrapper aggregates these into total_tokens_saved and exposes get_savings_summary() for quick reporting.
Async and Streaming Support
The wrapper implements _stream, _agenerate, and _astream, so both synchronous and asynchronous LangChain APIs work out of the box. You can call invoke, stream, ainvoke, and astream on a HeadroomChatModel exactly like the underlying model.
Install the LangChain Extra
Install Headroom with the LangChain extra to pull in the required integration files:
pip install "headroom-ai[langchain]"
Wrap Any LangChain Chat Model
The following example shows the basic usage of HeadroomChatModel to compress chat interactions:
from langchain_openai import ChatOpenAI
from headroom.integrations import HeadroomChatModel
llm = HeadroomChatModel(ChatOpenAI(model="gpt-4o"))
# Use exactly like the original model
response = llm.invoke([HumanMessage(content="Explain quantum tunnelling.")])
print(response.content)
# View token-saving stats
print(llm.get_savings_summary())
Advanced LangChain Integration Examples
Compress Memory with HeadroomChatMessageHistory
For long conversations, HeadroomChatMessageHistory in headroom/integrations/langchain/memory.py wraps any ChatMessageHistory to compress old turns automatically:
from langchain.memory import ConversationBufferMemory
from langchain_community.chat_message_histories import ChatMessageHistory
from headroom.integrations import HeadroomChatMessageHistory
base_history = ChatMessageHistory()
compressed_history = HeadroomChatMessageHistory(
base_history,
compress_threshold_tokens=4000,
keep_recent_turns=5,
)
memory = ConversationBufferMemory(chat_memory=compressed_history)
chain = ConversationChain(
llm=HeadroomChatModel(ChatOpenAI(model="gpt-4o")),
memory=memory
)
Compress Retrieved Documents
Use HeadroomDocumentCompressor from headroom/integrations/langchain/retriever.py as a BaseDocumentCompressor inside a ContextualCompressionRetriever:
from langchain.retrievers import ContextualCompressionRetriever
from langchain_community.vectorstores import FAISS
from headroom.integrations import HeadroomDocumentCompressor
vectorstore = FAISS.from_documents(docs, embeddings)
base_retriever = vectorstore.as_retriever(search_kwargs={"k": 50})
compressor = HeadroomDocumentCompressor(
max_documents=10,
min_relevance=0.3,
prefer_diverse=True
)
retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=base_retriever,
)
Wrap Agent Tools to Compress Large Outputs
The wrap_tools_with_headroom helper in headroom/integrations/langchain/agents.py decorates LangChain tools so large tool outputs are compressed before the next LLM turn:
from langchain_core.tools import tool
from headroom.integrations import wrap_tools_with_headroom
@tool
def search_web(query: str) -> str:
"""Return a large JSON payload."""
...
tools = wrap_tools_with_headroom([search_web], min_chars_to_compress=1000)
llm = HeadroomChatModel(ChatOpenAI(model="gpt-4o"))
agent = create_openai_tools_agent(llm, tools, prompt)
executor = AgentExecutor(agent=agent, tools=tools)
result = executor.invoke({"input": "Find recent Python libraries"})
Async Streaming
Because HeadroomChatModel implements _agenerate and _astream, async patterns work identically to the wrapped model:
async def async_chat():
llm = HeadroomChatModel(ChatOpenAI(model="gpt-4o"))
result = await llm.ainvoke([HumanMessage(content="What's the weather in Paris?")])
print(result.content)
async for chunk in llm.astream([HumanMessage(content="Tell a joke.")]):
print(chunk.content, end="", flush=True)
# asyncio.run(async_chat())
Track Streaming Metrics in Real Time
StreamingMetricsTracker in headroom/integrations/langchain/streaming.py records token usage while streaming:
from headroom.integrations import StreamingMetricsTracker
tracker = StreamingMetricsTracker(model="gpt-4o")
for chunk in llm.stream([HumanMessage(content="Write a poem.")]):
tracker.add_chunk(chunk)
print(chunk.content, end="")
metrics = tracker.finish()
print(f"Output tokens: {metrics.output_tokens}, duration: {metrics.duration_ms:.0f} ms")
Key Integration Files
All LangChain-specific components live under headroom/integrations/langchain/ in the chopratejas/headroom repository:
headroom/integrations/langchain/chat_model.py—HeadroomChatModelwrapper and optimization logic.headroom/integrations/langchain/providers.py— Provider auto-detection utilities.headroom/integrations/langchain/memory.py—HeadroomChatMessageHistoryfor automatic memory compression.headroom/integrations/langchain/retriever.py—HeadroomDocumentCompressorfor retrieval pipelines.headroom/integrations/langchain/agents.py—wrap_tools_with_headroomfor tool output compression.headroom/integrations/langchain/streaming.py—StreamingMetricsTrackerandStreamingMetricsCallback.headroom/integrations/langchain/langsmith.py— LangChain-specific observability callbacks.wiki/langchain.md— High-level integration documentation.
Summary
- Wrap any chat model with
HeadroomChatModelinheadroom/integrations/langchain/chat_model.pyto apply Headroom compression transparently. - Provider auto-detection in
providers.pymatches your LangChain model to the correct Headroom provider for accurate token counting. - Memory, retriever, and tool integrations compress conversation history, retrieved documents, and agent tool outputs respectively.
- Async and streaming APIs are fully supported through
_stream,_agenerate, and_astream. - Built-in metrics via
OptimizationMetricsandget_savings_summary()let you measure token savings on every call.
Frequently Asked Questions
Which LangChain chat models are compatible with Headroom?
Any model that implements LangChain's BaseChatModel interface is compatible. When auto_detect_provider=True, HeadroomChatModel inspects the wrapped class—such as ChatOpenAI or ChatAnthropic—and selects the corresponding Headroom provider for accurate tokenization. If your model is not auto-detected, you can configure the provider manually.
Does Headroom support async and streaming in LangChain?
Yes. The HeadroomChatModel wrapper implements _stream, _agenerate, and _astream, so both synchronous and asynchronous LangChain APIs work without modification. You can call invoke, stream, ainvoke, and astream exactly as you would on the underlying model.
How do I view token savings after a compressed chat call?
Each optimization pass generates an OptimizationMetrics record tracking tokens before and after compression. The wrapper aggregates these into total_tokens_saved and exposes get_savings_summary() for a quick report. For streaming scenarios, use StreamingMetricsTracker from headroom/integrations/langchain/streaming.py to collect real-time usage statistics.
Can I compress old conversation turns automatically?
Yes. Use HeadroomChatMessageHistory from headroom/integrations/langchain/memory.py to wrap any LangChain ChatMessageHistory. Set compress_threshold_tokens to define when compression triggers and keep_recent_turns to preserve the most recent exchanges uncompressed. This is ideal for long-running conversation chains that would otherwise exceed context limits.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →