How Headroom Works with Agno Models for Agentic Workflows

Headroom wraps any Agno model in a HeadroomAgnoModel that intercepts every LLM call to compress message history via a token-saving pipeline before delegating back to the underlying model.

The open-source Headroom library from chopratejas/headroom gives Agno agents built-in context optimisation by acting as a transparent proxy layer between the agent and the model. When you integrate Headroom with Agno models for agentic workflows, every normal request, tool call, and streamed response passes through a compression pipeline that reduces token usage without breaking reasoning chains. This integration lives in the headroom/integrations/agno/ package and requires no changes to existing agent logic.

HeadroomAgnoModel Wrapper Architecture

The core of the integration is the HeadroomAgnoModel dataclass defined in headroom/integrations/agno/model.py. It inherits from agno.models.base.Model and stores three critical attributes: the original Agno model, a HeadroomConfig, and a lazily initialised TransformPipeline.

Because the wrapper subclasses Agno’s base model, it overrides every response* and invoke* method. Each override first calls _optimize_messages to compress the message history, then delegates to the wrapped model’s matching method. This design means Headroom’s optimisation happens right at the model layer, ensuring no LLM call bypasses the pipeline. A threading.Lock protects the shared metrics state so concurrent agent requests remain safe.

Provider Detection and Token Accounting

Accurate token counting requires knowing which provider tokenizer to use. Headroom solves this in headroom/integrations/agno/providers.py through the get_headroom_provider helper.

This function inspects the Agno model’s class name, module path, or model ID to map it to the correct Headroom token-counter—whether the backend is OpenAI, Anthropic, Google, Cohere, or another supported provider. If detection fails, the system falls back to OpenAIProvider and emits a runtime warning so the agent keeps running with a best-effort count.

Message Flow Through the Pipeline

When an Agno agent calls model.response(messages)—including async variants like aresponse and streaming methods like response_stream—the wrapper executes the following sequence:

  1. _ensure_message_objects validates that all items in the incoming list are properly typed Agno Message objects.
  2. _optimize_messages converts the Agno messages into OpenAI-formatted dictionaries via _convert_messages_to_openai.
  3. The wrapper checks for Claude extended-thinking blocks with _has_thinking_blocks. If present, optimisation is skipped entirely to preserve native reasoning structure.
  4. Assuming no thinking blocks, the wrapper lazily initialises the TransformPipeline, auto-detecting the provider via get_headroom_provider.
  5. The pipeline runs transforms such as SmartCrusher and TagProtector on the message list, producing optimised messages and updating OptimizationMetrics.
  6. The running total of saved tokens is accumulated in _total_tokens_saved.
  7. The OpenAI-formatted messages are converted back into Agno Message objects via _convert_messages_from_openai so the underlying model can call its own _log_messages.
  8. Finally, the wrapper delegates to the wrapped model’s invoke, ainvoke, invoke_stream, or ainvoke_stream.

Integration Code Examples

You can wrap an Agno model in a single line.

from agno.models.openai import OpenAIChat
from headroom.integrations.agno import HeadroomAgnoModel

model = HeadroomAgnoModel(OpenAIChat(id="gpt-4o"))

For production agentic workflows, attach the optional observability hooks to monitor token spend.

from agno.agent import Agent
from headroom.integrations.agno import (
    HeadroomAgnoModel,
    HeadroomPreHook,
    HeadroomPostHook,
)

model = HeadroomAgnoModel(OpenAIChat(id="gpt-4o"))
pre = HeadroomPreHook()
post = HeadroomPostHook(token_alert_threshold=10_000)

agent = Agent(model=model, pre_hooks=[pre], post_hooks=[post])
answer = agent.run("Summarize the latest AI research.")
print("Tokens saved:", model.total_tokens_saved)
print("Post-hook summary:", post.get_summary())

Async and streaming endpoints are supported transparently.

import asyncio
from agno.models.openai import OpenAIChat
from headroom.integrations.agno import HeadroomAgnoModel

async def main():
    model = HeadroomAgnoModel(OpenAIChat(id="gpt-4o"))
    msgs = [{"role": "user", "content": "Explain quantum computing"}]
    resp = await model.aresponse(msgs)
    print(resp)

asyncio.run(main())

If you prefer to optimise messages without wrapping a model, use the standalone helper from headroom/integrations/agno/model.py.

from headroom.integrations.agno import optimize_messages

msgs = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Give me a 10 KB JSON report on climate data."},
]
opt_msgs, metrics = optimize_messages(msgs, model="gpt-4o")
print("Saved:", metrics["tokens_saved"])

You can also disable provider auto-detection to force a specific tokenizer.

from agno.models.openai import OpenAIChat
from headroom.integrations.agno import HeadroomAgnoModel

model = HeadroomAgnoModel(
    OpenAIChat(id="gpt-4o"),
    auto_detect_provider=False,   # forces OpenAI token counting

)

Resilience and Observability Features

The Headroom Agno integration is built for production agentic workflows where uptime matters. If the TransformPipeline defined in headroom/transforms/pipeline.py raises an exception during optimisation, the wrapper catches the error, logs a warning, and proceeds with the original unoptimised messages. This stateless fallback guarantees that a compression failure never breaks the agent.

Additionally, optional pre- and post-hooks—HeadroomPreHook, HeadroomPostHook, and the create_headroom_hooks factory—let you record request-level metrics and emit alerts when token usage crosses thresholds without modifying model code. The user-facing integration guide at wiki/agno.md contains full installation steps and troubleshooting notes. The test suite in tests/test_integrations/agno/test_model.py validates wrapping behaviour, provider detection, and metric tracking.

Summary

  • The HeadroomAgnoModel wraps any Agno model to inject token compression at the model layer.
  • Provider auto-detection in headroom/integrations/agno/providers.py ensures accurate token counting across OpenAI, Anthropic, Google, and Cohere backends.
  • Every LLM call follows an eight-step message flow that converts, optionally compresses, and converts back Agno messages while respecting Claude extended-thinking blocks.
  • The integration is thread-safe, includes a stateless fallback on pipeline errors, and supports optional observability hooks for production monitoring.
  • Standalone helpers like optimize_messages let developers use Headroom without permanently wrapping a model.

Frequently Asked Questions

What happens if Headroom cannot detect the Agno model provider?

If get_headroom_provider cannot infer the provider from the model class or ID, it falls back to OpenAIProvider and emits a runtime warning so token counting continues with a best-effort estimate. This fallback ensures your agent keeps running even when the provider mapping is ambiguous.

Does Headroom break Claude’s extended-thinking or reasoning blocks?

No. The wrapper explicitly checks for extended-thinking blocks via _has_thinking_blocks inside headroom/integrations/agno/model.py, and if they are present it skips optimisation entirely to preserve the native reasoning structure. This prevents Headroom from altering the special message content that Claude requires for reasoning.

Can I use Headroom with async and streaming Agno agents?

Yes. HeadroomAgnoModel overrides the async variants (aresponse, ainvoke, ainvoke_stream) and streaming methods (response_stream, invoke_stream) so all invocation patterns pass through the same _optimize_messages pipeline. As a result, async and streaming agents receive the same token savings as synchronous calls without any extra configuration.

Is it possible to optimise messages without wrapping the Agno model?

Yes. The optimize_messages helper exported from headroom/integrations/agno applies the same TransformPipeline logic to a raw message list and returns the compressed messages plus OptimizationMetrics, without requiring a HeadroomAgnoModel instance. It is useful for one-off optimisations or when you want to inspect metrics before deciding to wrap a model permanently.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →