How to Integrate Headroom with LangChain for Context Compression
Wrap any LangChain BaseChatModel in HeadroomChatModel to automatically compress, cache, and trim prompts before they reach the LLM, preserving essential information while staying inside the model's context window.
Headroom is a plug-in-style SDK that optimizes LLM prompts through automatic context compression. When combined with LangChain, the HeadroomChatModel wrapper intercepts outbound messages, applies a transformation pipeline, and forwards the optimized payload to your underlying chat model. According to the chopratejas/headroom source code, this integration requires no changes to existing chain logic and supports sync, async, streaming, and tool-bound workflows.
Why Wrap Instead of Using a Callback?
LangChain callbacks cannot mutate the message list due to a framework design limitation. Because of this constraint, Headroom implements a wrapper rather than a callback, ensuring that all outbound messages are processed regardless of whether the user calls invoke, batch, stream, or ainvoke.
How HeadroomChatModel Works
The core integration lives in headroom/integrations/langchain/chat_model.py. The wrapper fully implements the LangChain BaseChatModel API and runs every request through a multi-stage optimization pass.
Message Conversion
LangChain messages—including SystemMessage, HumanMessage, AIMessage, and ToolMessage—are translated to an OpenAI-compatible JSON format and back. This bidirectional mapping is handled by _convert_messages_to_openai and _convert_messages_from_openai inside headroom/integrations/langchain/chat_model.py (lines 47–86).
Pipeline Creation and Auto-Detection
On first use, the wrapper lazily builds a TransformPipeline (headroom.transforms.pipeline.TransformPipeline) that holds all compression transforms such as SmartCrusher and CacheAligner. The provider—OpenAI, Anthropic, or another—is auto-detected from the wrapped model via get_headroom_provider in headroom/integrations/langchain/chat_model.py (lines 24–33).
Optimization Pass
Before each LLM call, messages are passed through self.pipeline.apply(...). The pipeline returns a Result object containing the optimized message list and token statistics (tokens_before, tokens_after, transforms_applied). These values are stored in an OptimizationMetrics dataclass and aggregated on the model instance through _metrics_history and _total_tokens_saved (lines 120–138 and 48–75).
Delegation and Observability
After optimization, the reduced message set is handed to the original LangChain model’s _generate, _stream, _agenerate, or _astream methods. You can inspect savings via HeadroomChatModel.get_savings_summary(), and the companion HeadroomCallbackHandler exposes token-saving statistics for cost-control dashboards (lines 104–112).
Basic Integration Examples
Minimal Sync Wrapper
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage
from headroom.integrations import HeadroomChatModel
base = ChatOpenAI(model="gpt-4o-mini")
optimized = HeadroomChatModel(base) # ← adds compression
resp = optimized.invoke([HumanMessage("Summarize the latest news.")])
print(resp.content)
Async Usage
import asyncio
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage
from headroom.integrations import HeadroomChatModel
async def main():
base = ChatOpenAI(model="gpt-4o")
optimized = HeadroomChatModel(base)
result = await optimized.ainvoke([HumanMessage("What is the meaning of life?")])
print(result.content)
asyncio.run(main())
Streaming with Token Savings
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage
from headroom.integrations import HeadroomChatModel
base = ChatOpenAI(model="gpt-4o", streaming=True)
optimized = HeadroomChatModel(base)
for chunk in optimized.stream([HumanMessage("Write a poem about AI.")]):
print(chunk.content, end="", flush=True)
print("\nTokens saved:", optimized.total_tokens_saved)
Advanced Patterns
Custom Configuration
Supply a HeadroomConfig to tweak thresholds, enable rolling windows, or adjust compression behavior:
from headroom import HeadroomConfig, HeadroomMode
from langchain_openai import ChatOpenAI
from headroom.integrations import HeadroomChatModel
config = HeadroomConfig(
smart_crusher_threshold=400, # compress tool output > 400 tokens
smart_crusher_max_items=15, # keep only the top 15 items
cache_alignment=True,
rolling_window=True,
)
base = ChatOpenAI(model="gpt-4o")
headroom_llm = HeadroomChatModel(base, config=config, mode=HeadroomMode.OPTIMIZE)
Tool Binding
Because HeadroomChatModel implements the full BaseChatModel interface, tool binding works exactly like a normal LangChain model:
tools = [search_tool, docs_tool] # LangChain @tool functions
headroom_llm = headroom_llm.bind_tools(tools) # type: ignore[arg-type]
LCEL Composition
For LangChain Expression Language (LCEL) chains, insert the HeadroomRunnable between your prompt and LLM. The runnable is defined in headroom/integrations/langchain/runnable.py:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from headroom.integrations import HeadroomRunnable
prompt = ChatPromptTemplate.from_messages([
("system", "You are a concise assistant."),
("user", "{input}"),
])
llm = ChatOpenAI(model="gpt-4o")
chain = prompt | HeadroomRunnable() | llm
print(chain.invoke({"input": "List the steps to brew coffee."}))
Inspecting Token Savings
After running inference, retrieve aggregated statistics with get_savings_summary():
summary = optimized.get_savings_summary()
print("Total tokens saved:", summary["total_tokens_saved"])
print("Average savings %:", summary["average_savings_percent"])
The underlying data is tracked in _metrics_history and _total_tokens_saved on the HeadroomChatModel instance (lines 120–138).
Real-World Demo Reference
The repository ships an end-to-end demo at examples/langchain_demo/run_comparison.py (lines 41–68). The script creates a HeadroomChatModel, runs realistic support-scenario queries with and without compression, and prints token-saving statistics. It is the best reference for wiring together multi-tool agents and observing before/after behavior.
Summary
- Wrap, don't callback: Because LangChain callbacks cannot mutate messages, Headroom uses
HeadroomChatModelinheadroom/integrations/langchain/chat_model.pyto intercept and optimize every request. - Auto-detection: The provider is auto-detected via
get_headroom_providerinheadroom/integrations/langchain/providers.py, and messages are converted using_convert_messages_to_openaiand_convert_messages_from_openai. - Pipeline architecture: A lazy
TransformPipelineapplies compression transforms such asSmartCrusherandCacheAlignerbefore the call reaches_generate,_stream,_agenerate, or_astream. - Full API coverage: The wrapper supports sync, async, streaming, tool binding, and LCEL composition through
HeadroomRunnable. - Observability built-in: Token savings are aggregated in
OptimizationMetricsand exposed viaget_savings_summary()andHeadroomCallbackHandler.
Frequently Asked Questions
Does Headroom work with any LangChain chat model?
Yes. HeadroomChatModel wraps any class that inherits from LangChain's BaseChatModel, including models from langchain-openai, langchain-anthropic, and langchain-google-genai. The provider is auto-detected in headroom/integrations/langchain/providers.py so the correct compression profile is selected automatically.
Can I use Headroom with async and streaming methods?
Yes. The wrapper delegates to the underlying model's _generate, _stream, _agenerate, and _astream methods after optimization. You can call invoke, stream, ainvoke, or astream on a HeadroomChatModel instance exactly as you would on the base model.
Why can't I just use a LangChain callback for context compression?
LangChain callbacks are designed for observation and logging, not mutation. Due to a framework limitation, callbacks cannot modify the message list before it is sent to the LLM. Headroom solves this by wrapping the model itself, guaranteeing that all outbound traffic passes through the compression pipeline.
Where are the compression transforms implemented?
Individual transforms such as SmartCrusher and CacheAligner live in headroom/transforms/*.py. The wrapper assembles them into a TransformPipeline (headroom.transforms.pipeline.TransformPipeline) that is lazily instantiated when the first request is made.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →