# How Headroom Works with Agno Models for Agentic Workflows

> Learn how Headroom optimizes agentic workflows with Agno models. Discover its token-saving pipeline for efficient LLM call management and history compression.

- Repository: [Tejas Chopra/headroom](https://github.com/chopratejas/headroom)
- Tags: how-to-guide
- Published: 2026-06-05

---

**Headroom wraps any Agno model in a `HeadroomAgnoModel` that intercepts every LLM call to compress message history via a token-saving pipeline before delegating back to the underlying model.**

The open-source Headroom library from `chopratejas/headroom` gives Agno agents built-in context optimisation by acting as a transparent proxy layer between the agent and the model. When you integrate Headroom with Agno models for agentic workflows, every normal request, tool call, and streamed response passes through a compression pipeline that reduces token usage without breaking reasoning chains. This integration lives in the `headroom/integrations/agno/` package and requires no changes to existing agent logic.

## HeadroomAgnoModel Wrapper Architecture

The core of the integration is the **`HeadroomAgnoModel`** dataclass defined in [`headroom/integrations/agno/model.py`](https://github.com/chopratejas/headroom/blob/main/headroom/integrations/agno/model.py). It inherits from `agno.models.base.Model` and stores three critical attributes: the original Agno model, a **`HeadroomConfig`**, and a lazily initialised **`TransformPipeline`**.

Because the wrapper subclasses Agno’s base model, it overrides every `response*` and `invoke*` method. Each override first calls **`_optimize_messages`** to compress the message history, then delegates to the wrapped model’s matching method. This design means Headroom’s optimisation happens right at the model layer, ensuring no LLM call bypasses the pipeline. A **`threading.Lock`** protects the shared metrics state so concurrent agent requests remain safe.

## Provider Detection and Token Accounting

Accurate token counting requires knowing which provider tokenizer to use. Headroom solves this in [`headroom/integrations/agno/providers.py`](https://github.com/chopratejas/headroom/blob/main/headroom/integrations/agno/providers.py) through the **`get_headroom_provider`** helper.

This function inspects the Agno model’s class name, module path, or model ID to map it to the correct Headroom token-counter—whether the backend is OpenAI, Anthropic, Google, Cohere, or another supported provider. If detection fails, the system falls back to **`OpenAIProvider`** and emits a runtime warning so the agent keeps running with a best-effort count.

## Message Flow Through the Pipeline

When an Agno agent calls `model.response(messages)`—including async variants like `aresponse` and streaming methods like `response_stream`—the wrapper executes the following sequence:

1. **`_ensure_message_objects`** validates that all items in the incoming list are properly typed Agno `Message` objects.
2. **`_optimize_messages`** converts the Agno messages into OpenAI-formatted dictionaries via **`_convert_messages_to_openai`**.
3. The wrapper checks for Claude extended-thinking blocks with **`_has_thinking_blocks`**. If present, optimisation is skipped entirely to preserve native reasoning structure.
4. Assuming no thinking blocks, the wrapper lazily initialises the **`TransformPipeline`**, auto-detecting the provider via `get_headroom_provider`.
5. The pipeline runs transforms such as **`SmartCrusher`** and **`TagProtector`** on the message list, producing optimised messages and updating **`OptimizationMetrics`**.
6. The running total of saved tokens is accumulated in **`_total_tokens_saved`**.
7. The OpenAI-formatted messages are converted back into Agno `Message` objects via **`_convert_messages_from_openai`** so the underlying model can call its own `_log_messages`.
8. Finally, the wrapper delegates to the wrapped model’s `invoke`, `ainvoke`, `invoke_stream`, or `ainvoke_stream`.

## Integration Code Examples

You can wrap an Agno model in a single line.

```python
from agno.models.openai import OpenAIChat
from headroom.integrations.agno import HeadroomAgnoModel

model = HeadroomAgnoModel(OpenAIChat(id="gpt-4o"))

```

For production agentic workflows, attach the optional observability hooks to monitor token spend.

```python
from agno.agent import Agent
from headroom.integrations.agno import (
    HeadroomAgnoModel,
    HeadroomPreHook,
    HeadroomPostHook,
)

model = HeadroomAgnoModel(OpenAIChat(id="gpt-4o"))
pre = HeadroomPreHook()
post = HeadroomPostHook(token_alert_threshold=10_000)

agent = Agent(model=model, pre_hooks=[pre], post_hooks=[post])
answer = agent.run("Summarize the latest AI research.")
print("Tokens saved:", model.total_tokens_saved)
print("Post-hook summary:", post.get_summary())

```

Async and streaming endpoints are supported transparently.

```python
import asyncio
from agno.models.openai import OpenAIChat
from headroom.integrations.agno import HeadroomAgnoModel

async def main():
    model = HeadroomAgnoModel(OpenAIChat(id="gpt-4o"))
    msgs = [{"role": "user", "content": "Explain quantum computing"}]
    resp = await model.aresponse(msgs)
    print(resp)

asyncio.run(main())

```

If you prefer to optimise messages without wrapping a model, use the standalone helper from [`headroom/integrations/agno/model.py`](https://github.com/chopratejas/headroom/blob/main/headroom/integrations/agno/model.py).

```python
from headroom.integrations.agno import optimize_messages

msgs = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Give me a 10 KB JSON report on climate data."},
]
opt_msgs, metrics = optimize_messages(msgs, model="gpt-4o")
print("Saved:", metrics["tokens_saved"])

```

You can also disable provider auto-detection to force a specific tokenizer.

```python
from agno.models.openai import OpenAIChat
from headroom.integrations.agno import HeadroomAgnoModel

model = HeadroomAgnoModel(
    OpenAIChat(id="gpt-4o"),
    auto_detect_provider=False,   # forces OpenAI token counting

)

```

## Resilience and Observability Features

The Headroom Agno integration is built for production agentic workflows where uptime matters. If the **`TransformPipeline`** defined in [`headroom/transforms/pipeline.py`](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/pipeline.py) raises an exception during optimisation, the wrapper catches the error, logs a warning, and proceeds with the original unoptimised messages. This **stateless fallback** guarantees that a compression failure never breaks the agent.

Additionally, optional pre- and post-hooks—**`HeadroomPreHook`**, **`HeadroomPostHook`**, and the **`create_headroom_hooks`** factory—let you record request-level metrics and emit alerts when token usage crosses thresholds without modifying model code. The user-facing integration guide at [`wiki/agno.md`](https://github.com/chopratejas/headroom/blob/main/wiki/agno.md) contains full installation steps and troubleshooting notes. The test suite in [`tests/test_integrations/agno/test_model.py`](https://github.com/chopratejas/headroom/blob/main/tests/test_integrations/agno/test_model.py) validates wrapping behaviour, provider detection, and metric tracking.

## Summary

- The **`HeadroomAgnoModel`** wraps any Agno model to inject token compression at the model layer.
- **Provider auto-detection** in [`headroom/integrations/agno/providers.py`](https://github.com/chopratejas/headroom/blob/main/headroom/integrations/agno/providers.py) ensures accurate token counting across OpenAI, Anthropic, Google, and Cohere backends.
- Every LLM call follows an eight-step message flow that converts, optionally compresses, and converts back Agno messages while respecting Claude extended-thinking blocks.
- The integration is **thread-safe**, includes a **stateless fallback** on pipeline errors, and supports optional **observability hooks** for production monitoring.
- Standalone helpers like **`optimize_messages`** let developers use Headroom without permanently wrapping a model.

## Frequently Asked Questions

### What happens if Headroom cannot detect the Agno model provider?

If `get_headroom_provider` cannot infer the provider from the model class or ID, it falls back to `OpenAIProvider` and emits a runtime warning so token counting continues with a best-effort estimate. This fallback ensures your agent keeps running even when the provider mapping is ambiguous.

### Does Headroom break Claude’s extended-thinking or reasoning blocks?

No. The wrapper explicitly checks for extended-thinking blocks via `_has_thinking_blocks` inside [`headroom/integrations/agno/model.py`](https://github.com/chopratejas/headroom/blob/main/headroom/integrations/agno/model.py), and if they are present it skips optimisation entirely to preserve the native reasoning structure. This prevents Headroom from altering the special message content that Claude requires for reasoning.

### Can I use Headroom with async and streaming Agno agents?

Yes. `HeadroomAgnoModel` overrides the async variants (`aresponse`, `ainvoke`, `ainvoke_stream`) and streaming methods (`response_stream`, `invoke_stream`) so all invocation patterns pass through the same `_optimize_messages` pipeline. As a result, async and streaming agents receive the same token savings as synchronous calls without any extra configuration.

### Is it possible to optimise messages without wrapping the Agno model?

Yes. The `optimize_messages` helper exported from `headroom/integrations/agno` applies the same `TransformPipeline` logic to a raw message list and returns the compressed messages plus `OptimizationMetrics`, without requiring a `HeadroomAgnoModel` instance. It is useful for one-off optimisations or when you want to inspect metrics before deciding to wrap a model permanently.