Headroom Audit vs Optimize vs Simulate: Three Operating Modes Explained
Headroom’s audit mode observes requests without modifying payloads, optimize applies deterministic transforms to compress context, and simulate runs a dry-run that returns a transformation plan without calling the LLM.
The Headroom proxy intercepts LLM requests to reduce token usage and latency. Its runtime behavior is controlled by the HeadroomMode enum defined in [headroom/models/config.py](https://github.com/chopratejas/headroom/blob/main/headroom/models/config.py), which determines whether the system observes traffic, actively optimizes payloads, or simulates changes for cost estimation. Understanding these three operating modes is essential for deploying Headroom safely in production environments.
Audit Mode: Observation Without Modification
In audit mode, the proxy observes every request and records what it would change, but does not modify the payload sent to the LLM. This mode is ideal for production monitoring, baseline measurement, and safety-first deployments where you need visibility into Headroom's behavior without affecting live traffic.
When running in audit mode, the request passes through to the LLM unchanged, but the proxy adds X-Headroom-* headers containing audit information about which transforms would have been applied.
from headroom import HeadroomClient, OpenAIProvider
from openai import OpenAI
client = HeadroomClient(
original_client=OpenAI(),
provider=OpenAIProvider(),
default_mode="audit", # ← observe only
)
resp = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Explain quantum tunneling"}],
headroom_mode="audit", # override per‑request (optional)
)
print(resp) # contains normal LLM output plus X‑Headroom‑* headers with audit info
Optimize Mode: Live Compression and Transformation
The optimize mode applies safe, deterministic transforms to the request before it reaches the LLM. This is the default for performance-focused deployments and actively compresses context to reduce token costs and latency.
Transforms applied in this mode include SmartCrusher, CacheAligner, and RollingWindow, which compress large JSON payloads, align cache prefixes, and remove low-importance conversation turns. The modified payload is then sent to the LLM, with X-Headroom-* headers showing actual token savings.
client = HeadroomClient(
original_client=OpenAI(),
provider=OpenAIProvider(),
default_mode="optimize", # ← enable compression
)
resp = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Generate a large JSON report"}],
)
print(resp) # payload is compressed; X‑Headroom‑* headers show savings
Simulate Mode: Dry-Run Cost Estimation
The simulate mode does not call the upstream LLM. Instead, it returns a Plan object describing which transforms would run and the estimated token savings. This mode is designed for testing, cost-estimation, CI pipelines, and any scenario requiring a dry-run without incurring LLM usage charges.
The Plan object contains tokens_saved, transforms, and estimated_savings properties, allowing you to preview optimization impact before enabling live mode.
plan = client.chat.completions.simulate(
model="gpt-4o",
messages=[{"role": "user", "content": "Generate a large JSON report"}],
)
print(f"Would save {plan.tokens_saved} tokens")
print("Transforms that would run:", plan.transforms)
print("Estimated cost reduction:", plan.estimated_savings)
Configuring Headroom Modes
You can configure the operating mode at three levels according to the source code in [headroom/models/config.py](https://github.com/chopratejas/headroom/blob/main/headroom/models/config.py):
- SDK Construction: Set
default_mode="audit","optimize", or"simulate"when initializingHeadroomClientin [headroom/client.py](https://github.com/chopratejas/headroom/blob/main/headroom/client.py). - Per-Request Override: Pass
headroom_mode="audit"(oroptimize/simulate) intoclient.chat.completions.create()to override the default for a single request. - Proxy Command-Line: Use
headroom proxy --no-optimizeto disable optimization entirely, effectively forcingauditmode at the infrastructure level.
Implementation Details
The mode logic is implemented across several key files in the chopratejas/headroom repository:
- [
headroom/models/config.py](https://github.com/chopratejas/headroom/blob/main/headroom/models/config.py): Defines theHeadroomModeenum withAUDIT,OPTIMIZE, andSIMULATEvalues. - [
headroom/client.py](https://github.com/chopratejas/headroom/blob/main/headroom/client.py): Implements the Python SDK, including mode switching inchat.completions.create()and thesimulate()method. headroom/transforms/: Contains individual transform implementations (e.g.,smart_crusher.py,cache_aligner.py) that execute only whenoptimizemode is active.
Summary
- Audit mode observes traffic and logs potential changes without modifying requests, perfect for production monitoring.
- Optimize mode applies deterministic transforms like SmartCrusher and CacheAligner to reduce tokens and latency in live traffic.
- Simulate mode returns a
Planobject with cost estimates without calling the LLM, ideal for CI testing and dry-runs. - Configure modes via SDK constructor, per-request parameters, or proxy CLI flags in [
headroom/models/config.py](https://github.com/chopratejas/headroom/blob/main/headroom/models/config.py).
Frequently Asked Questions
Can I switch between audit and optimize mode without restarting the proxy?
Yes. You can override the default mode on a per-request basis by passing headroom_mode="audit" or headroom_mode="optimize" to client.chat.completions.create(). Alternatively, use the headroom proxy --no-optimize CLI flag to force audit mode across all traffic without code changes.
What information does the simulate mode return?
Simulate mode returns a Plan object containing tokens_saved, transforms, and estimated_savings properties. This object details exactly which transforms (such as SmartCrusher or CacheAligner) would execute and quantifies the expected token reduction without making an actual LLM API call.
Does audit mode impact latency?
Audit mode adds minimal latency because it only inspects requests and adds headers without performing compute-intensive transforms. However, it does not provide the token savings or latency reduction benefits of optimize mode, which actively compresses payloads before transmission to the LLM.
Which transforms run in optimize mode?
The optimize mode executes deterministic transforms located in headroom/transforms/, including SmartCrusher for JSON compression, CacheAligner for prefix optimization, and RollingWindow for conversation history management. These transforms modify the request payload before it reaches the LLM.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →