How to Integrate Headroom with LangChain for Context Compression

Wrap any LangChain BaseChatModel in HeadroomChatModel to automatically compress, cache, and trim prompts before they reach the LLM, preserving essential information while staying inside the model's context window.

Headroom is a plug-in-style SDK that optimizes LLM prompts through automatic context compression. When combined with LangChain, the HeadroomChatModel wrapper intercepts outbound messages, applies a transformation pipeline, and forwards the optimized payload to your underlying chat model. According to the chopratejas/headroom source code, this integration requires no changes to existing chain logic and supports sync, async, streaming, and tool-bound workflows.

Why Wrap Instead of Using a Callback?

LangChain callbacks cannot mutate the message list due to a framework design limitation. Because of this constraint, Headroom implements a wrapper rather than a callback, ensuring that all outbound messages are processed regardless of whether the user calls invoke, batch, stream, or ainvoke.

How HeadroomChatModel Works

The core integration lives in headroom/integrations/langchain/chat_model.py. The wrapper fully implements the LangChain BaseChatModel API and runs every request through a multi-stage optimization pass.

Message Conversion

LangChain messages—including SystemMessage, HumanMessage, AIMessage, and ToolMessage—are translated to an OpenAI-compatible JSON format and back. This bidirectional mapping is handled by _convert_messages_to_openai and _convert_messages_from_openai inside headroom/integrations/langchain/chat_model.py (lines 47–86).

Pipeline Creation and Auto-Detection

On first use, the wrapper lazily builds a TransformPipeline (headroom.transforms.pipeline.TransformPipeline) that holds all compression transforms such as SmartCrusher and CacheAligner. The provider—OpenAI, Anthropic, or another—is auto-detected from the wrapped model via get_headroom_provider in headroom/integrations/langchain/chat_model.py (lines 24–33).

Optimization Pass

Before each LLM call, messages are passed through self.pipeline.apply(...). The pipeline returns a Result object containing the optimized message list and token statistics (tokens_before, tokens_after, transforms_applied). These values are stored in an OptimizationMetrics dataclass and aggregated on the model instance through _metrics_history and _total_tokens_saved (lines 120–138 and 48–75).

Delegation and Observability

After optimization, the reduced message set is handed to the original LangChain model’s _generate, _stream, _agenerate, or _astream methods. You can inspect savings via HeadroomChatModel.get_savings_summary(), and the companion HeadroomCallbackHandler exposes token-saving statistics for cost-control dashboards (lines 104–112).

Basic Integration Examples

Minimal Sync Wrapper

from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage
from headroom.integrations import HeadroomChatModel

base = ChatOpenAI(model="gpt-4o-mini")
optimized = HeadroomChatModel(base)  # ← adds compression

resp = optimized.invoke([HumanMessage("Summarize the latest news.")])
print(resp.content)

Async Usage

import asyncio
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage
from headroom.integrations import HeadroomChatModel

async def main():
    base = ChatOpenAI(model="gpt-4o")
    optimized = HeadroomChatModel(base)
    result = await optimized.ainvoke([HumanMessage("What is the meaning of life?")])
    print(result.content)

asyncio.run(main())

Streaming with Token Savings

from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage
from headroom.integrations import HeadroomChatModel

base = ChatOpenAI(model="gpt-4o", streaming=True)
optimized = HeadroomChatModel(base)

for chunk in optimized.stream([HumanMessage("Write a poem about AI.")]):
    print(chunk.content, end="", flush=True)

print("\nTokens saved:", optimized.total_tokens_saved)

Advanced Patterns

Custom Configuration

Supply a HeadroomConfig to tweak thresholds, enable rolling windows, or adjust compression behavior:

from headroom import HeadroomConfig, HeadroomMode
from langchain_openai import ChatOpenAI
from headroom.integrations import HeadroomChatModel

config = HeadroomConfig(
    smart_crusher_threshold=400,   # compress tool output > 400 tokens

    smart_crusher_max_items=15,    # keep only the top 15 items

    cache_alignment=True,
    rolling_window=True,
)

base = ChatOpenAI(model="gpt-4o")
headroom_llm = HeadroomChatModel(base, config=config, mode=HeadroomMode.OPTIMIZE)

Tool Binding

Because HeadroomChatModel implements the full BaseChatModel interface, tool binding works exactly like a normal LangChain model:

tools = [search_tool, docs_tool]  # LangChain @tool functions

headroom_llm = headroom_llm.bind_tools(tools)  # type: ignore[arg-type]

LCEL Composition

For LangChain Expression Language (LCEL) chains, insert the HeadroomRunnable between your prompt and LLM. The runnable is defined in headroom/integrations/langchain/runnable.py:

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from headroom.integrations import HeadroomRunnable

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a concise assistant."),
    ("user", "{input}"),
])
llm = ChatOpenAI(model="gpt-4o")
chain = prompt | HeadroomRunnable() | llm

print(chain.invoke({"input": "List the steps to brew coffee."}))

Inspecting Token Savings

After running inference, retrieve aggregated statistics with get_savings_summary():

summary = optimized.get_savings_summary()
print("Total tokens saved:", summary["total_tokens_saved"])
print("Average savings %:", summary["average_savings_percent"])

The underlying data is tracked in _metrics_history and _total_tokens_saved on the HeadroomChatModel instance (lines 120–138).

Real-World Demo Reference

The repository ships an end-to-end demo at examples/langchain_demo/run_comparison.py (lines 41–68). The script creates a HeadroomChatModel, runs realistic support-scenario queries with and without compression, and prints token-saving statistics. It is the best reference for wiring together multi-tool agents and observing before/after behavior.

Summary

  • Wrap, don't callback: Because LangChain callbacks cannot mutate messages, Headroom uses HeadroomChatModel in headroom/integrations/langchain/chat_model.py to intercept and optimize every request.
  • Auto-detection: The provider is auto-detected via get_headroom_provider in headroom/integrations/langchain/providers.py, and messages are converted using _convert_messages_to_openai and _convert_messages_from_openai.
  • Pipeline architecture: A lazy TransformPipeline applies compression transforms such as SmartCrusher and CacheAligner before the call reaches _generate, _stream, _agenerate, or _astream.
  • Full API coverage: The wrapper supports sync, async, streaming, tool binding, and LCEL composition through HeadroomRunnable.
  • Observability built-in: Token savings are aggregated in OptimizationMetrics and exposed via get_savings_summary() and HeadroomCallbackHandler.

Frequently Asked Questions

Does Headroom work with any LangChain chat model?

Yes. HeadroomChatModel wraps any class that inherits from LangChain's BaseChatModel, including models from langchain-openai, langchain-anthropic, and langchain-google-genai. The provider is auto-detected in headroom/integrations/langchain/providers.py so the correct compression profile is selected automatically.

Can I use Headroom with async and streaming methods?

Yes. The wrapper delegates to the underlying model's _generate, _stream, _agenerate, and _astream methods after optimization. You can call invoke, stream, ainvoke, or astream on a HeadroomChatModel instance exactly as you would on the base model.

Why can't I just use a LangChain callback for context compression?

LangChain callbacks are designed for observation and logging, not mutation. Due to a framework limitation, callbacks cannot modify the message list before it is sent to the LLM. Headroom solves this by wrapping the model itself, guaranteeing that all outbound traffic passes through the compression pipeline.

Where are the compression transforms implemented?

Individual transforms such as SmartCrusher and CacheAligner live in headroom/transforms/*.py. The wrapper assembles them into a TransformPipeline (headroom.transforms.pipeline.TransformPipeline) that is lazily instantiated when the first request is made.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →