Headroom Audit vs Optimize vs Simulate: Three Operating Modes Explained

Headroom’s audit mode observes requests without modifying payloads, optimize applies deterministic transforms to compress context, and simulate runs a dry-run that returns a transformation plan without calling the LLM.

The Headroom proxy intercepts LLM requests to reduce token usage and latency. Its runtime behavior is controlled by the HeadroomMode enum defined in [headroom/models/config.py](https://github.com/chopratejas/headroom/blob/main/headroom/models/config.py), which determines whether the system observes traffic, actively optimizes payloads, or simulates changes for cost estimation. Understanding these three operating modes is essential for deploying Headroom safely in production environments.

Audit Mode: Observation Without Modification

In audit mode, the proxy observes every request and records what it would change, but does not modify the payload sent to the LLM. This mode is ideal for production monitoring, baseline measurement, and safety-first deployments where you need visibility into Headroom's behavior without affecting live traffic.

When running in audit mode, the request passes through to the LLM unchanged, but the proxy adds X-Headroom-* headers containing audit information about which transforms would have been applied.

from headroom import HeadroomClient, OpenAIProvider
from openai import OpenAI

client = HeadroomClient(
    original_client=OpenAI(),
    provider=OpenAIProvider(),
    default_mode="audit",          # ← observe only

)

resp = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Explain quantum tunneling"}],
    headroom_mode="audit",        # override per‑request (optional)

)
print(resp)   # contains normal LLM output plus X‑Headroom‑* headers with audit info

Optimize Mode: Live Compression and Transformation

The optimize mode applies safe, deterministic transforms to the request before it reaches the LLM. This is the default for performance-focused deployments and actively compresses context to reduce token costs and latency.

Transforms applied in this mode include SmartCrusher, CacheAligner, and RollingWindow, which compress large JSON payloads, align cache prefixes, and remove low-importance conversation turns. The modified payload is then sent to the LLM, with X-Headroom-* headers showing actual token savings.

client = HeadroomClient(
    original_client=OpenAI(),
    provider=OpenAIProvider(),
    default_mode="optimize",      # ← enable compression

)

resp = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Generate a large JSON report"}],
)
print(resp)   # payload is compressed; X‑Headroom‑* headers show savings

Simulate Mode: Dry-Run Cost Estimation

The simulate mode does not call the upstream LLM. Instead, it returns a Plan object describing which transforms would run and the estimated token savings. This mode is designed for testing, cost-estimation, CI pipelines, and any scenario requiring a dry-run without incurring LLM usage charges.

The Plan object contains tokens_saved, transforms, and estimated_savings properties, allowing you to preview optimization impact before enabling live mode.

plan = client.chat.completions.simulate(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Generate a large JSON report"}],
)

print(f"Would save {plan.tokens_saved} tokens")
print("Transforms that would run:", plan.transforms)
print("Estimated cost reduction:", plan.estimated_savings)

Configuring Headroom Modes

You can configure the operating mode at three levels according to the source code in [headroom/models/config.py](https://github.com/chopratejas/headroom/blob/main/headroom/models/config.py):

  1. SDK Construction: Set default_mode="audit", "optimize", or "simulate" when initializing HeadroomClient in [headroom/client.py](https://github.com/chopratejas/headroom/blob/main/headroom/client.py).
  2. Per-Request Override: Pass headroom_mode="audit" (or optimize/simulate) into client.chat.completions.create() to override the default for a single request.
  3. Proxy Command-Line: Use headroom proxy --no-optimize to disable optimization entirely, effectively forcing audit mode at the infrastructure level.

Implementation Details

The mode logic is implemented across several key files in the chopratejas/headroom repository:

Summary

  • Audit mode observes traffic and logs potential changes without modifying requests, perfect for production monitoring.
  • Optimize mode applies deterministic transforms like SmartCrusher and CacheAligner to reduce tokens and latency in live traffic.
  • Simulate mode returns a Plan object with cost estimates without calling the LLM, ideal for CI testing and dry-runs.
  • Configure modes via SDK constructor, per-request parameters, or proxy CLI flags in [headroom/models/config.py](https://github.com/chopratejas/headroom/blob/main/headroom/models/config.py).

Frequently Asked Questions

Can I switch between audit and optimize mode without restarting the proxy?

Yes. You can override the default mode on a per-request basis by passing headroom_mode="audit" or headroom_mode="optimize" to client.chat.completions.create(). Alternatively, use the headroom proxy --no-optimize CLI flag to force audit mode across all traffic without code changes.

What information does the simulate mode return?

Simulate mode returns a Plan object containing tokens_saved, transforms, and estimated_savings properties. This object details exactly which transforms (such as SmartCrusher or CacheAligner) would execute and quantifies the expected token reduction without making an actual LLM API call.

Does audit mode impact latency?

Audit mode adds minimal latency because it only inspects requests and adds headers without performing compute-intensive transforms. However, it does not provide the token savings or latency reduction benefits of optimize mode, which actively compresses payloads before transmission to the LLM.

Which transforms run in optimize mode?

The optimize mode executes deterministic transforms located in headroom/transforms/, including SmartCrusher for JSON compression, CacheAligner for prefix optimization, and RollingWindow for conversation history management. These transforms modify the request payload before it reaches the LLM.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →