How to Use Headroom with LangChain's ChatModel for Context Compression
Wrap any LangChain BaseChatModel with HeadroomChatModel to automatically compress chat contexts before each LLM call, reducing token usage while preserving conversation quality.
The Headroom library (chopratejas/headroom) provides a first-class integration that intercepts messages between LangChain and your LLM, applying intelligent compression via the TransformPipeline. This integration subclasses LangChain's BaseChatModel to ensure full compatibility with existing chains and agents.
How the Integration Works
The LangChain integration centers on HeadroomChatModel, located in headroom/integrations/langchain/chat_model.py. This wrapper intercepts every call to the wrapped LLM, converts LangChain message objects to the OpenAI-style format required by Headroom, runs the TransformPipeline, and converts the optimized messages back to LangChain format before sending to the underlying model.
Provider Auto-Detection
When auto_detect_provider=True (the default), the wrapper inspects the wrapped model class (e.g., ChatOpenAI, ChatAnthropic) and automatically selects the matching Headroom provider (OpenAIProvider, AnthropicProvider, etc.). This ensures accurate token counting and enables provider-specific caching optimizations.
Metrics and Observability
Each optimization pass produces an OptimizationMetrics record containing tokens before/after, savings percentage, and applied transforms. The wrapper aggregates these into total_tokens_saved and exposes get_savings_summary() for quick reporting.
Installation
Install Headroom with the LangChain extra to access all integration components:
pip install "headroom-ai[langchain]"
This installs the headroom.integrations.langchain subpackage, which includes chat model wrappers, memory helpers, retriever components, and streaming utilities.
Basic Usage: Wrapping Chat Models
To enable compression, wrap your existing LangChain chat model with HeadroomChatModel:
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage
from headroom.integrations import HeadroomChatModel
# Wrap the model
llm = HeadroomChatModel(ChatOpenAI(model="gpt-4o"))
# Use exactly like the original model
response = llm.invoke([HumanMessage(content="Explain quantum tunnelling.")])
print(response.content)
# View token-saving stats
print(llm.get_savings_summary())
The wrapped model supports all standard LangChain invocation methods (invoke, batch, stream) while transparently compressing context windows.
Advanced Integration Patterns
Async and Streaming Support
The wrapper implements _stream, _agenerate, and _astream, enabling asynchronous and streaming APIs:
async def async_chat():
llm = HeadroomChatModel(ChatOpenAI(model="gpt-4o"))
# Async invoke
result = await llm.ainvoke([HumanMessage(content="What's the weather in Paris?")])
print(result.content)
# Async streaming
async for chunk in llm.astream([HumanMessage(content="Tell a joke.")]):
print(chunk.content, end="", flush=True)
# asyncio.run(async_chat())
Memory Compression with HeadroomChatMessageHistory
For long-running conversations, use HeadroomChatMessageHistory to compress old turns automatically while preserving recent context:
from langchain.memory import ConversationBufferMemory
from langchain_community.chat_message_histories import ChatMessageHistory
from headroom.integrations import HeadroomChatMessageHistory
from langchain.chains import ConversationChain
base_history = ChatMessageHistory()
compressed_history = HeadroomChatMessageHistory(
base_history,
compress_threshold_tokens=4000, # compress when >4K tokens
keep_recent_turns=5, # always keep the last 5 turns
)
memory = ConversationBufferMemory(chat_memory=compressed_history)
chain = ConversationChain(
llm=HeadroomChatModel(ChatOpenAI(model="gpt-4o")),
memory=memory
)
This implementation, found in headroom/integrations/langchain/memory.py, ensures that only the most recent and salient conversation history reaches the LLM.
Document Retrieval Compression
Use HeadroomDocumentCompressor inside LangChain's ContextualCompressionRetriever to filter retrieved documents:
from langchain.retrievers import ContextualCompressionRetriever
from langchain_community.vectorstores import FAISS
from headroom.integrations import HeadroomDocumentCompressor
vectorstore = FAISS.from_documents(docs, embeddings)
base_retriever = vectorstore.as_retriever(search_kwargs={"k": 50})
compressor = HeadroomDocumentCompressor(
max_documents=10,
min_relevance=0.3,
prefer_diverse=True
)
retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=base_retriever,
)
Found in headroom/integrations/langchain/retriever.py, this component intelligently selects the most relevant documents from large retrieval sets.
Tool Output Compression
For agents that generate large tool outputs, use wrap_tools_with_headroom (from headroom/integrations/langchain/agents.py) to compress results before they hit the LLM context window:
from langchain_core.tools import tool
from headroom.integrations import wrap_tools_with_headroom
@tool
def search_web(query: str) -> str:
"""Return a large JSON payload."""
...
tools = wrap_tools_with_headroom([search_web], min_chars_to_compress=1000)
Monitoring Token Savings
Track real-time compression metrics during streaming with StreamingMetricsTracker:
from headroom.integrations import StreamingMetricsTracker
tracker = StreamingMetricsTracker(model="gpt-4o")
for chunk in llm.stream([HumanMessage(content="Write a poem.")]):
tracker.add_chunk(chunk)
print(chunk.content, end="")
metrics = tracker.finish()
print(f"Output tokens: {metrics.output_tokens}, Saved: {tracker.total_saved}")
Alternatively, use StreamingMetricsCallback for LangChain callback-based observability.
Key Source Files
The integration is modularized across the following locations:
headroom/integrations/langchain/chat_model.py– CoreHeadroomChatModelwrapper and optimization logicheadroom/integrations/langchain/providers.py– Provider auto-detection utilitiesheadroom/integrations/langchain/memory.py–HeadroomChatMessageHistoryfor conversation compressionheadroom/integrations/langchain/retriever.py–HeadroomDocumentCompressorfor RAG pipelinesheadroom/integrations/langchain/agents.py– Tool-wrapping helpers (wrap_tools_with_headroom)headroom/integrations/langchain/streaming.py–StreamingMetricsTrackerand async streaming utilitiesheadroom/integrations/langchain/langsmith.py– LangSmith-specific callbacks for observability
Summary
- Wrap existing models: Use
HeadroomChatModelto add compression to any LangChainBaseChatModelwithout changing your invocation code. - Automatic provider detection: The wrapper identifies OpenAI, Anthropic, and other providers to ensure accurate token counting.
- Full async support: All methods (
invoke,stream,ainvoke,astream) work with compression enabled. - Memory and retrieval: Specialized classes handle conversation history and document compression automatically.
- Built-in metrics: Access
get_savings_summary()andOptimizationMetricsto quantify token reductions.
Frequently Asked Questions
Does Headroom support streaming with LangChain?
Yes. HeadroomChatModel implements _stream and _astream methods, allowing you to use both synchronous llm.stream() and asynchronous llm.astream() methods. The compression occurs on the full context before streaming begins, ensuring the wrapped LLM receives optimized input while still streaming output tokens incrementally.
How does the provider auto-detection work?
When auto_detect_provider=True (default), the wrapper inspects the class name of the wrapped model (e.g., ChatOpenAI, ChatAnthropic) and maps it to the corresponding Headroom provider class (OpenAIProvider, AnthropicProvider). This mapping ensures correct token limits, pricing calculations, and provider-specific caching strategies are applied without manual configuration.
Can I use Headroom with existing LangChain memory classes?
Yes. Instead of replacing your memory implementation, wrap the underlying ChatMessageHistory with HeadroomChatMessageHistory. This allows you to use standard LangChain memory classes like ConversationBufferMemory while automatically compressing historical turns that exceed your compress_threshold_tokens limit, keeping the most recent keep_recent_turns uncompressed for immediate context.
What metrics does Headroom provide for tracking compression?
Each optimization generates an OptimizationMetrics object containing tokens_before, tokens_after, savings_percent, and applied transforms. The HeadroomChatModel aggregates these into total_tokens_saved, accessible via get_savings_summary(). For streaming scenarios, StreamingMetricsTracker provides real-time token accounting including output token counts and generation duration.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →