Headroom Audit, Optimize, and Simulate Modes Explained
Headroom's three operating modes—audit, optimize, and simulate—determine whether the proxy observes traffic without changes, applies compression transforms, or runs a dry-run to estimate savings without calling the LLM.
Headroom is an open-source LLM proxy developed by chopratejas/headroom that reduces token usage through intelligent context compression. Understanding the differences between audit, optimize, and simulate modes is essential for safely deploying the tool across development, staging, and production environments. These mutually exclusive modes control whether transforms are observed, applied, or merely planned against your traffic.
How Headroom Modes Are Defined
In headroom/models/config.py, the HeadroomMode enum defines the three operating states that drive runtime behavior:
class HeadroomMode(str, Enum):
AUDIT = "audit" # Observe only, no modifications
OPTIMIZE = "optimize" # Apply deterministic transforms
SIMULATE = "simulate" # Return transform plan without API call
This enum is referenced throughout the SDK implementation in headroom/client.py and determines which path the request takes through the proxy pipeline.
Audit Mode: Production Observation Without Risk
In audit mode, Headroom acts as a transparent proxy that intercepts every request, evaluates which transforms would apply, but sends the original payload unchanged to the LLM. This mode is designed for production monitoring and baseline measurement where you need visibility into potential savings without affecting live traffic.
When running in audit, the request reaches the upstream LLM exactly as sent by the client. However, the response includes X-Headroom-* headers containing metadata about which transforms would have run and the estimated token savings. This safety-first approach lets you measure impact before enabling live compression.
Optimize Mode: Live Compression and Latency Reduction
Optimize mode enables Headroom to actively apply safe, deterministic transforms before requests reach the LLM. According to the source code, this includes transforms such as SmartCrusher, CacheAligner, and RollingWindow located in the headroom/transforms/ directory.
This is the default mode for performance-focused deployments. When enabled, the proxy compresses large JSON payloads, aligns cache prefixes for better cache hits, and drops low-importance conversation turns to minimize token usage and reduce latency. The transforms are deterministic—the same input always produces the same compressed output—making this mode safe for production environments where consistent behavior is required.
Simulate Mode: Dry-Run Testing and Cost Estimation
Simulate mode provides a complete dry-run of the compression pipeline without actually calling the upstream LLM. Instead of returning LLM output, the method returns a Plan object describing exactly which transforms would execute and their projected impact.
As implemented in headroom/client.py, calling client.chat.completions.simulate() returns an object containing tokens_saved, transforms, and estimated_savings. This mode is ideal for CI pipelines, cost estimation workflows, or any scenario where you want to calculate potential savings without incurring LLM API charges.
Configuring Modes in Your Application
You can set Headroom's operating mode at three different levels, as documented in wiki/configuration.md:
- SDK Construction: Set
default_modewhen initializingHeadroomClient - Per-Request Override: Pass
headroom_modeparameter toclient.chat.completions.create() - Command-Line Flag: Use
--no-optimizeto force audit mode in the proxy
Setting Mode at SDK Initialization
from headroom import HeadroomClient, OpenAIProvider
from openai import OpenAI
client = HeadroomClient(
original_client=OpenAI(),
provider=OpenAIProvider(),
default_mode="audit", # Options: "audit", "optimize", "simulate"
)
Overriding Mode Per Request
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Explain quantum tunneling"}],
headroom_mode="simulate", # Override default for this specific call
)
CLI Configuration
When running the Headroom proxy, use the --no-optimize flag to disable optimization entirely, effectively forcing all traffic into audit mode regardless of client-side settings:
headroom proxy --no-optimize
Practical Examples by Mode
Using Audit Mode for Production Monitoring
client = HeadroomClient(
original_client=OpenAI(),
provider=OpenAIProvider(),
default_mode="audit",
)
resp = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Explain quantum tunneling"}],
headroom_mode="audit",
)
print(resp) # Normal LLM output plus X‑Headroom‑* headers with audit info
Running Optimize for Live Compression
client = HeadroomClient(
original_client=OpenAI(),
provider=OpenAIProvider(),
default_mode="optimize",
)
resp = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Generate a large JSON report"}],
)
print(resp) # Payload is compressed; X‑Headroom‑* headers show actual savings
Estimating Savings with Simulate
plan = client.chat.completions.simulate(
model="gpt-4o",
messages=[{"role": "user", "content": "Generate a large JSON report"}],
)
print(f"Would save {plan.tokens_saved} tokens")
print("Transforms that would run:", plan.transforms)
print("Estimated cost reduction:", plan.estimated_savings)
Summary
- Audit mode observes traffic and logs what would change without modifying requests, ideal for safe production monitoring and baseline measurement
- Optimize mode applies deterministic transforms like
SmartCrusherandCacheAlignerto reduce tokens and latency in live production environments - Simulate mode runs a dry-run returning a
Planobject withtokens_savedandestimated_savingswithout calling the LLM, perfect for CI testing and cost estimation - Modes can be configured via SDK initialization (
default_mode), per-request overrides (headroom_mode), or CLI flags (--no-optimize) - The
HeadroomModeenum is defined inheadroom/models/config.pyand implemented throughout theheadroom/source tree, includingheadroom/client.pyand the transforms directory
Frequently Asked Questions
Can I switch between audit and optimize mode without restarting my application?
Yes. You can override the mode on a per-request basis by passing the headroom_mode parameter to client.chat.completions.create(). This allows you to keep the SDK initialized with one default mode while selectively running individual requests in a different mode, such as testing optimize behavior on a single request while the rest of your traffic remains in audit.
Does simulate mode cost anything or call the LLM?
No. Simulate mode performs a dry-run of the compression pipeline against your request and returns a Plan object without calling the upstream LLM. This makes it useful for CI pipelines and cost estimation workflows where you want to calculate potential token savings and visualize which transforms would apply without incurring any API charges.
What transforms run in optimize mode?
In optimize mode, Headroom applies deterministic transforms including SmartCrusher for payload compression, CacheAligner for prefix optimization, and RollingWindow for context window management. These are implemented in the headroom/transforms/ directory and apply only when the system determines they can safely reduce token count without affecting response quality.
How do I force audit mode across all requests?
You can force audit mode by launching the Headroom proxy with the --no-optimize command-line flag. This disables optimization entirely at the proxy level, ensuring all traffic is observed but not modified regardless of what default_mode or headroom_mode settings clients attempt to use.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →