What Is the Difference Between Audit, Optimize, and Simulate Modes in Headroom?

Headroom's Audit mode observes requests without modifying them, Optimize mode applies deterministic transforms to reduce tokens, and Simulate mode returns a transformation plan without calling the upstream LLM.

Headroom is an open-source LLM proxy that optimizes context windows and reduces API costs through intelligent compression. The tool's runtime behavior is governed by the HeadroomMode enum defined in headroom/models/config.py, which determines whether the proxy merely observes traffic, actively transforms payloads, or runs a cost-estimation dry-run.

Understanding the HeadroomMode Enum

The core architecture defines three mutually exclusive operating modes in [headroom/models/config.py](https://github.com/chopratejas/headroom/blob/main/headroom/models/config.py):

class HeadroomMode(str, Enum):
    AUDIT = "audit"       # Observe only, no modifications

    OPTIMIZE = "optimize" # Apply deterministic transforms

    SIMULATE = "simulate" # Return transform plan without API call

Each mode serves distinct operational requirements, from safe production monitoring to aggressive cost optimization.

Audit Mode: Observe Without Modifying

Audit mode is the safety-first option for production monitoring. In this mode, Headroom inspects every request and records what transforms would have been applied, but the payload sent to the LLM remains completely unchanged.

This mode is ideal for baseline measurement and validating compression strategies before enabling them live. The proxy returns standard LLM responses augmented with X-Headroom-* headers containing the audit metadata.

Using Audit Mode

Configure Audit mode at the SDK level or per-request:

from headroom import HeadroomClient, OpenAIProvider
from openai import OpenAI

client = HeadroomClient(
    original_client=OpenAI(),
    provider=OpenAIProvider(),
    default_mode="audit",          # ← observe only

)

resp = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Explain quantum tunneling"}],
    headroom_mode="audit",        # optional per-request override

)

The request proceeds to the LLM unmodified while Headroom logs which transforms would have run.

Optimize Mode: Apply Production Transforms

Optimize mode is the default for performance-focused deployments. When enabled, Headroom applies deterministic transforms such as SmartCrusher, CacheAligner, and RollingWindow to compress the request before it reaches the LLM.

According to the source code in headroom/transforms/, these transforms aggressively reduce token count by removing redundant context, aligning cache prefixes, and compressing large JSON payloads without altering semantic meaning.

Using Optimize Mode

client = HeadroomClient(
    original_client=OpenAI(),
    provider=OpenAIProvider(),
    default_mode="optimize",      # ← enable compression

)

resp = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Generate a large JSON report"}],
)

The response includes headers showing actual tokens saved. To disable optimization entirely via CLI, use headroom proxy --no-optimize, which effectively forces Audit behavior across all traffic.

Simulate Mode: Dry-Run Cost Estimation

Simulate mode provides a complete dry-run of the compression pipeline without incurring LLM costs. Instead of calling the upstream API, Headroom returns a Plan object describing which transforms would execute and the estimated savings.

This mode is essential for CI pipelines, cost forecasting, and testing compression strategies against historical traffic. The Plan object contains tokens_saved, transforms, and estimated_savings fields.

Using Simulate Mode

Invoke via the dedicated simulate method in headroom/client.py:

plan = client.chat.completions.simulate(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Generate a large JSON report"}],
)

print(f"Would save {plan.tokens_saved} tokens")
print("Transforms that would run:", plan.transforms)
print("Estimated cost reduction:", plan.estimated_savings)

No network call to the LLM provider occurs during simulation.

Configuring Modes at Three Levels

Headroom supports mode configuration at multiple granularity levels:

  1. SDK Construction: Set default_mode="audit", "optimize", or "simulate" when instantiating HeadroomClient.

  2. Per-Request Override: Pass headroom_mode="audit" (or any valid mode) into client.chat.completions.create() to override the default for a single call.

  3. Proxy Command-Line: Launch the proxy with headroom proxy --no-optimize to disable optimization globally, forcing Audit behavior.

Summary

  • Audit mode observes traffic and reports potential savings without modifying requests, making it safe for production monitoring.
  • Optimize mode actively applies transforms like SmartCrusher and CacheAligner to reduce token counts and latency in live environments.
  • Simulate mode returns a Plan object with cost estimates and transform details without calling the upstream LLM, perfect for testing and CI.
  • All three modes are defined in headroom/models/config.py and can be configured via the SDK, per-request parameters, or CLI flags.

Frequently Asked Questions

What happens to my LLM response in Audit mode?

The response returns normally from the upstream LLM without modification. Headroom only injects X-Headroom-* headers indicating which transforms would have been applied and the estimated tokens that could have been saved.

Can I switch modes for individual requests without changing the SDK default?

Yes. Pass the headroom_mode parameter directly to client.chat.completions.create() to override the client's default mode for that specific request. This allows you to audit specific high-risk requests while keeping optimization enabled by default.

Which transforms run in Optimize mode?

The specific transforms depend on your configuration, but commonly include SmartCrusher for aggressive context compression, CacheAligner for prefix optimization, and RollingWindow for managing conversation history. These are implemented in the headroom/transforms/ directory and applied deterministically before the request reaches the LLM.

Does Simulate mode consume LLM API tokens?

No. Simulate mode performs all transformation logic locally and returns a Plan object without making any network call to the upstream LLM provider. This makes it ideal for cost estimation and integration testing without incurring API charges.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →