Headroom Audit, Optimize, and Simulate Modes Explained

Headroom's three operating modes—audit, optimize, and simulate—determine whether the proxy observes traffic without changes, applies compression transforms, or runs a dry-run to estimate savings without calling the LLM.

Headroom is an open-source LLM proxy developed by chopratejas/headroom that reduces token usage through intelligent context compression. Understanding the differences between audit, optimize, and simulate modes is essential for safely deploying the tool across development, staging, and production environments. These mutually exclusive modes control whether transforms are observed, applied, or merely planned against your traffic.

How Headroom Modes Are Defined

In headroom/models/config.py, the HeadroomMode enum defines the three operating states that drive runtime behavior:

class HeadroomMode(str, Enum):
    AUDIT = "audit"       # Observe only, no modifications

    OPTIMIZE = "optimize" # Apply deterministic transforms

    SIMULATE = "simulate" # Return transform plan without API call

This enum is referenced throughout the SDK implementation in headroom/client.py and determines which path the request takes through the proxy pipeline.

Audit Mode: Production Observation Without Risk

In audit mode, Headroom acts as a transparent proxy that intercepts every request, evaluates which transforms would apply, but sends the original payload unchanged to the LLM. This mode is designed for production monitoring and baseline measurement where you need visibility into potential savings without affecting live traffic.

When running in audit, the request reaches the upstream LLM exactly as sent by the client. However, the response includes X-Headroom-* headers containing metadata about which transforms would have run and the estimated token savings. This safety-first approach lets you measure impact before enabling live compression.

Optimize Mode: Live Compression and Latency Reduction

Optimize mode enables Headroom to actively apply safe, deterministic transforms before requests reach the LLM. According to the source code, this includes transforms such as SmartCrusher, CacheAligner, and RollingWindow located in the headroom/transforms/ directory.

This is the default mode for performance-focused deployments. When enabled, the proxy compresses large JSON payloads, aligns cache prefixes for better cache hits, and drops low-importance conversation turns to minimize token usage and reduce latency. The transforms are deterministic—the same input always produces the same compressed output—making this mode safe for production environments where consistent behavior is required.

Simulate Mode: Dry-Run Testing and Cost Estimation

Simulate mode provides a complete dry-run of the compression pipeline without actually calling the upstream LLM. Instead of returning LLM output, the method returns a Plan object describing exactly which transforms would execute and their projected impact.

As implemented in headroom/client.py, calling client.chat.completions.simulate() returns an object containing tokens_saved, transforms, and estimated_savings. This mode is ideal for CI pipelines, cost estimation workflows, or any scenario where you want to calculate potential savings without incurring LLM API charges.

Configuring Modes in Your Application

You can set Headroom's operating mode at three different levels, as documented in wiki/configuration.md:

  1. SDK Construction: Set default_mode when initializing HeadroomClient
  2. Per-Request Override: Pass headroom_mode parameter to client.chat.completions.create()
  3. Command-Line Flag: Use --no-optimize to force audit mode in the proxy

Setting Mode at SDK Initialization

from headroom import HeadroomClient, OpenAIProvider
from openai import OpenAI

client = HeadroomClient(
    original_client=OpenAI(),
    provider=OpenAIProvider(),
    default_mode="audit",  # Options: "audit", "optimize", "simulate"

)

Overriding Mode Per Request

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Explain quantum tunneling"}],
    headroom_mode="simulate",  # Override default for this specific call

)

CLI Configuration

When running the Headroom proxy, use the --no-optimize flag to disable optimization entirely, effectively forcing all traffic into audit mode regardless of client-side settings:

headroom proxy --no-optimize

Practical Examples by Mode

Using Audit Mode for Production Monitoring

client = HeadroomClient(
    original_client=OpenAI(),
    provider=OpenAIProvider(),
    default_mode="audit",
)

resp = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Explain quantum tunneling"}],
    headroom_mode="audit",
)
print(resp)   # Normal LLM output plus X‑Headroom‑* headers with audit info

Running Optimize for Live Compression

client = HeadroomClient(
    original_client=OpenAI(),
    provider=OpenAIProvider(),
    default_mode="optimize",
)

resp = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Generate a large JSON report"}],
)
print(resp)   # Payload is compressed; X‑Headroom‑* headers show actual savings

Estimating Savings with Simulate

plan = client.chat.completions.simulate(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Generate a large JSON report"}],
)

print(f"Would save {plan.tokens_saved} tokens")
print("Transforms that would run:", plan.transforms)
print("Estimated cost reduction:", plan.estimated_savings)

Summary

  • Audit mode observes traffic and logs what would change without modifying requests, ideal for safe production monitoring and baseline measurement
  • Optimize mode applies deterministic transforms like SmartCrusher and CacheAligner to reduce tokens and latency in live production environments
  • Simulate mode runs a dry-run returning a Plan object with tokens_saved and estimated_savings without calling the LLM, perfect for CI testing and cost estimation
  • Modes can be configured via SDK initialization (default_mode), per-request overrides (headroom_mode), or CLI flags (--no-optimize)
  • The HeadroomMode enum is defined in headroom/models/config.py and implemented throughout the headroom/ source tree, including headroom/client.py and the transforms directory

Frequently Asked Questions

Can I switch between audit and optimize mode without restarting my application?

Yes. You can override the mode on a per-request basis by passing the headroom_mode parameter to client.chat.completions.create(). This allows you to keep the SDK initialized with one default mode while selectively running individual requests in a different mode, such as testing optimize behavior on a single request while the rest of your traffic remains in audit.

Does simulate mode cost anything or call the LLM?

No. Simulate mode performs a dry-run of the compression pipeline against your request and returns a Plan object without calling the upstream LLM. This makes it useful for CI pipelines and cost estimation workflows where you want to calculate potential token savings and visualize which transforms would apply without incurring any API charges.

What transforms run in optimize mode?

In optimize mode, Headroom applies deterministic transforms including SmartCrusher for payload compression, CacheAligner for prefix optimization, and RollingWindow for context window management. These are implemented in the headroom/transforms/ directory and apply only when the system determines they can safely reduce token count without affecting response quality.

How do I force audit mode across all requests?

You can force audit mode by launching the Headroom proxy with the --no-optimize command-line flag. This disables optimization entirely at the proxy level, ensuring all traffic is observed but not modified regardless of what default_mode or headroom_mode settings clients attempt to use.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →