# Headroom Audit, Optimize, and Simulate Modes Explained

> Understand Headroom modes: audit observes traffic, optimize applies compression, and simulate estimates savings without LLM calls. Learn which mode fits your needs.

- Repository: [Tejas Chopra/headroom](https://github.com/chopratejas/headroom)
- Tags: deep-dive
- Published: 2026-06-08

---

**Headroom's three operating modes—audit, optimize, and simulate—determine whether the proxy observes traffic without changes, applies compression transforms, or runs a dry-run to estimate savings without calling the LLM.**

Headroom is an open-source LLM proxy developed by `chopratejas/headroom` that reduces token usage through intelligent context compression. Understanding the differences between **audit**, **optimize**, and **simulate** modes is essential for safely deploying the tool across development, staging, and production environments. These mutually exclusive modes control whether transforms are observed, applied, or merely planned against your traffic.

## How Headroom Modes Are Defined

In [`headroom/models/config.py`](https://github.com/chopratejas/headroom/blob/main/headroom/models/config.py), the `HeadroomMode` enum defines the three operating states that drive runtime behavior:

```python
class HeadroomMode(str, Enum):
    AUDIT = "audit"       # Observe only, no modifications

    OPTIMIZE = "optimize" # Apply deterministic transforms

    SIMULATE = "simulate" # Return transform plan without API call

```

This enum is referenced throughout the SDK implementation in [`headroom/client.py`](https://github.com/chopratejas/headroom/blob/main/headroom/client.py) and determines which path the request takes through the proxy pipeline.

## Audit Mode: Production Observation Without Risk

In **audit** mode, Headroom acts as a transparent proxy that intercepts every request, evaluates which transforms *would* apply, but sends the original payload unchanged to the LLM. This mode is designed for production monitoring and baseline measurement where you need visibility into potential savings without affecting live traffic.

When running in audit, the request reaches the upstream LLM exactly as sent by the client. However, the response includes `X-Headroom-*` headers containing metadata about which transforms would have run and the estimated token savings. This safety-first approach lets you measure impact before enabling live compression.

## Optimize Mode: Live Compression and Latency Reduction

**Optimize** mode enables Headroom to actively apply safe, deterministic transforms before requests reach the LLM. According to the source code, this includes transforms such as `SmartCrusher`, `CacheAligner`, and `RollingWindow` located in the `headroom/transforms/` directory.

This is the default mode for performance-focused deployments. When enabled, the proxy compresses large JSON payloads, aligns cache prefixes for better cache hits, and drops low-importance conversation turns to minimize token usage and reduce latency. The transforms are deterministic—the same input always produces the same compressed output—making this mode safe for production environments where consistent behavior is required.

## Simulate Mode: Dry-Run Testing and Cost Estimation

**Simulate** mode provides a complete dry-run of the compression pipeline without actually calling the upstream LLM. Instead of returning LLM output, the method returns a `Plan` object describing exactly which transforms would execute and their projected impact.

As implemented in [`headroom/client.py`](https://github.com/chopratejas/headroom/blob/main/headroom/client.py), calling `client.chat.completions.simulate()` returns an object containing `tokens_saved`, `transforms`, and `estimated_savings`. This mode is ideal for CI pipelines, cost estimation workflows, or any scenario where you want to calculate potential savings without incurring LLM API charges.

## Configuring Modes in Your Application

You can set Headroom's operating mode at three different levels, as documented in [`wiki/configuration.md`](https://github.com/chopratejas/headroom/blob/main/wiki/configuration.md):

1. **SDK Construction**: Set `default_mode` when initializing `HeadroomClient`
2. **Per-Request Override**: Pass `headroom_mode` parameter to `client.chat.completions.create()`
3. **Command-Line Flag**: Use `--no-optimize` to force audit mode in the proxy

### Setting Mode at SDK Initialization

```python
from headroom import HeadroomClient, OpenAIProvider
from openai import OpenAI

client = HeadroomClient(
    original_client=OpenAI(),
    provider=OpenAIProvider(),
    default_mode="audit",  # Options: "audit", "optimize", "simulate"

)

```

### Overriding Mode Per Request

```python
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Explain quantum tunneling"}],
    headroom_mode="simulate",  # Override default for this specific call

)

```

### CLI Configuration

When running the Headroom proxy, use the `--no-optimize` flag to disable optimization entirely, effectively forcing all traffic into audit mode regardless of client-side settings:

```bash
headroom proxy --no-optimize

```

## Practical Examples by Mode

### Using Audit Mode for Production Monitoring

```python
client = HeadroomClient(
    original_client=OpenAI(),
    provider=OpenAIProvider(),
    default_mode="audit",
)

resp = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Explain quantum tunneling"}],
    headroom_mode="audit",
)
print(resp)   # Normal LLM output plus X‑Headroom‑* headers with audit info

```

### Running Optimize for Live Compression

```python
client = HeadroomClient(
    original_client=OpenAI(),
    provider=OpenAIProvider(),
    default_mode="optimize",
)

resp = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Generate a large JSON report"}],
)
print(resp)   # Payload is compressed; X‑Headroom‑* headers show actual savings

```

### Estimating Savings with Simulate

```python
plan = client.chat.completions.simulate(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Generate a large JSON report"}],
)

print(f"Would save {plan.tokens_saved} tokens")
print("Transforms that would run:", plan.transforms)
print("Estimated cost reduction:", plan.estimated_savings)

```

## Summary

- **Audit** mode observes traffic and logs what would change without modifying requests, ideal for safe production monitoring and baseline measurement
- **Optimize** mode applies deterministic transforms like `SmartCrusher` and `CacheAligner` to reduce tokens and latency in live production environments
- **Simulate** mode runs a dry-run returning a `Plan` object with `tokens_saved` and `estimated_savings` without calling the LLM, perfect for CI testing and cost estimation
- Modes can be configured via SDK initialization (`default_mode`), per-request overrides (`headroom_mode`), or CLI flags (`--no-optimize`)
- The `HeadroomMode` enum is defined in [`headroom/models/config.py`](https://github.com/chopratejas/headroom/blob/main/headroom/models/config.py) and implemented throughout the `headroom/` source tree, including [`headroom/client.py`](https://github.com/chopratejas/headroom/blob/main/headroom/client.py) and the transforms directory

## Frequently Asked Questions

### Can I switch between audit and optimize mode without restarting my application?

Yes. You can override the mode on a per-request basis by passing the `headroom_mode` parameter to `client.chat.completions.create()`. This allows you to keep the SDK initialized with one default mode while selectively running individual requests in a different mode, such as testing optimize behavior on a single request while the rest of your traffic remains in audit.

### Does simulate mode cost anything or call the LLM?

No. Simulate mode performs a dry-run of the compression pipeline against your request and returns a `Plan` object without calling the upstream LLM. This makes it useful for CI pipelines and cost estimation workflows where you want to calculate potential token savings and visualize which transforms would apply without incurring any API charges.

### What transforms run in optimize mode?

In optimize mode, Headroom applies deterministic transforms including `SmartCrusher` for payload compression, `CacheAligner` for prefix optimization, and `RollingWindow` for context window management. These are implemented in the `headroom/transforms/` directory and apply only when the system determines they can safely reduce token count without affecting response quality.

### How do I force audit mode across all requests?

You can force audit mode by launching the Headroom proxy with the `--no-optimize` command-line flag. This disables optimization entirely at the proxy level, ensuring all traffic is observed but not modified regardless of what `default_mode` or `headroom_mode` settings clients attempt to use.