How Headroom Works with Agno Models for Agentic Workflows
Headroom wraps any Agno model in a HeadroomAgnoModel that intercepts every LLM call to compress message history via a token-saving pipeline before delegating back to the underlying model.
The open-source Headroom library from chopratejas/headroom gives Agno agents built-in context optimisation by acting as a transparent proxy layer between the agent and the model. When you integrate Headroom with Agno models for agentic workflows, every normal request, tool call, and streamed response passes through a compression pipeline that reduces token usage without breaking reasoning chains. This integration lives in the headroom/integrations/agno/ package and requires no changes to existing agent logic.
HeadroomAgnoModel Wrapper Architecture
The core of the integration is the HeadroomAgnoModel dataclass defined in headroom/integrations/agno/model.py. It inherits from agno.models.base.Model and stores three critical attributes: the original Agno model, a HeadroomConfig, and a lazily initialised TransformPipeline.
Because the wrapper subclasses Agno’s base model, it overrides every response* and invoke* method. Each override first calls _optimize_messages to compress the message history, then delegates to the wrapped model’s matching method. This design means Headroom’s optimisation happens right at the model layer, ensuring no LLM call bypasses the pipeline. A threading.Lock protects the shared metrics state so concurrent agent requests remain safe.
Provider Detection and Token Accounting
Accurate token counting requires knowing which provider tokenizer to use. Headroom solves this in headroom/integrations/agno/providers.py through the get_headroom_provider helper.
This function inspects the Agno model’s class name, module path, or model ID to map it to the correct Headroom token-counter—whether the backend is OpenAI, Anthropic, Google, Cohere, or another supported provider. If detection fails, the system falls back to OpenAIProvider and emits a runtime warning so the agent keeps running with a best-effort count.
Message Flow Through the Pipeline
When an Agno agent calls model.response(messages)—including async variants like aresponse and streaming methods like response_stream—the wrapper executes the following sequence:
_ensure_message_objectsvalidates that all items in the incoming list are properly typed AgnoMessageobjects._optimize_messagesconverts the Agno messages into OpenAI-formatted dictionaries via_convert_messages_to_openai.- The wrapper checks for Claude extended-thinking blocks with
_has_thinking_blocks. If present, optimisation is skipped entirely to preserve native reasoning structure. - Assuming no thinking blocks, the wrapper lazily initialises the
TransformPipeline, auto-detecting the provider viaget_headroom_provider. - The pipeline runs transforms such as
SmartCrusherandTagProtectoron the message list, producing optimised messages and updatingOptimizationMetrics. - The running total of saved tokens is accumulated in
_total_tokens_saved. - The OpenAI-formatted messages are converted back into Agno
Messageobjects via_convert_messages_from_openaiso the underlying model can call its own_log_messages. - Finally, the wrapper delegates to the wrapped model’s
invoke,ainvoke,invoke_stream, orainvoke_stream.
Integration Code Examples
You can wrap an Agno model in a single line.
from agno.models.openai import OpenAIChat
from headroom.integrations.agno import HeadroomAgnoModel
model = HeadroomAgnoModel(OpenAIChat(id="gpt-4o"))
For production agentic workflows, attach the optional observability hooks to monitor token spend.
from agno.agent import Agent
from headroom.integrations.agno import (
HeadroomAgnoModel,
HeadroomPreHook,
HeadroomPostHook,
)
model = HeadroomAgnoModel(OpenAIChat(id="gpt-4o"))
pre = HeadroomPreHook()
post = HeadroomPostHook(token_alert_threshold=10_000)
agent = Agent(model=model, pre_hooks=[pre], post_hooks=[post])
answer = agent.run("Summarize the latest AI research.")
print("Tokens saved:", model.total_tokens_saved)
print("Post-hook summary:", post.get_summary())
Async and streaming endpoints are supported transparently.
import asyncio
from agno.models.openai import OpenAIChat
from headroom.integrations.agno import HeadroomAgnoModel
async def main():
model = HeadroomAgnoModel(OpenAIChat(id="gpt-4o"))
msgs = [{"role": "user", "content": "Explain quantum computing"}]
resp = await model.aresponse(msgs)
print(resp)
asyncio.run(main())
If you prefer to optimise messages without wrapping a model, use the standalone helper from headroom/integrations/agno/model.py.
from headroom.integrations.agno import optimize_messages
msgs = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Give me a 10 KB JSON report on climate data."},
]
opt_msgs, metrics = optimize_messages(msgs, model="gpt-4o")
print("Saved:", metrics["tokens_saved"])
You can also disable provider auto-detection to force a specific tokenizer.
from agno.models.openai import OpenAIChat
from headroom.integrations.agno import HeadroomAgnoModel
model = HeadroomAgnoModel(
OpenAIChat(id="gpt-4o"),
auto_detect_provider=False, # forces OpenAI token counting
)
Resilience and Observability Features
The Headroom Agno integration is built for production agentic workflows where uptime matters. If the TransformPipeline defined in headroom/transforms/pipeline.py raises an exception during optimisation, the wrapper catches the error, logs a warning, and proceeds with the original unoptimised messages. This stateless fallback guarantees that a compression failure never breaks the agent.
Additionally, optional pre- and post-hooks—HeadroomPreHook, HeadroomPostHook, and the create_headroom_hooks factory—let you record request-level metrics and emit alerts when token usage crosses thresholds without modifying model code. The user-facing integration guide at wiki/agno.md contains full installation steps and troubleshooting notes. The test suite in tests/test_integrations/agno/test_model.py validates wrapping behaviour, provider detection, and metric tracking.
Summary
- The
HeadroomAgnoModelwraps any Agno model to inject token compression at the model layer. - Provider auto-detection in
headroom/integrations/agno/providers.pyensures accurate token counting across OpenAI, Anthropic, Google, and Cohere backends. - Every LLM call follows an eight-step message flow that converts, optionally compresses, and converts back Agno messages while respecting Claude extended-thinking blocks.
- The integration is thread-safe, includes a stateless fallback on pipeline errors, and supports optional observability hooks for production monitoring.
- Standalone helpers like
optimize_messageslet developers use Headroom without permanently wrapping a model.
Frequently Asked Questions
What happens if Headroom cannot detect the Agno model provider?
If get_headroom_provider cannot infer the provider from the model class or ID, it falls back to OpenAIProvider and emits a runtime warning so token counting continues with a best-effort estimate. This fallback ensures your agent keeps running even when the provider mapping is ambiguous.
Does Headroom break Claude’s extended-thinking or reasoning blocks?
No. The wrapper explicitly checks for extended-thinking blocks via _has_thinking_blocks inside headroom/integrations/agno/model.py, and if they are present it skips optimisation entirely to preserve the native reasoning structure. This prevents Headroom from altering the special message content that Claude requires for reasoning.
Can I use Headroom with async and streaming Agno agents?
Yes. HeadroomAgnoModel overrides the async variants (aresponse, ainvoke, ainvoke_stream) and streaming methods (response_stream, invoke_stream) so all invocation patterns pass through the same _optimize_messages pipeline. As a result, async and streaming agents receive the same token savings as synchronous calls without any extra configuration.
Is it possible to optimise messages without wrapping the Agno model?
Yes. The optimize_messages helper exported from headroom/integrations/agno applies the same TransformPipeline logic to a raw message list and returns the compressed messages plus OptimizationMetrics, without requiring a HeadroomAgnoModel instance. It is useful for one-off optimisations or when you want to inspect metrics before deciding to wrap a model permanently.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →