Headroom OpenTelemetry Monitoring and Observability Metrics: Complete Reference

Headroom exposes more than 28 production-grade OpenTelemetry metrics—including counters for proxy requests, token savings, and compression pipelines, histograms for end-to-end latency and per-stage duration, and observable gauges for subscription-window utilization—that are all defined in headroom/observability/metrics.py and emitted through the HeadroomOtelMetrics façade.

The chopratejas/headroom repository ships a dedicated OpenTelemetry (OTel) metrics façade that records production-grade telemetry for every proxy request, compression pipeline run, and subscription-window update. These monitoring and observability metrics are instantiated by configure_otel_metrics()—or the default get_otel_metrics() when OTel is disabled—and exported via standard OTLP endpoints or the console.

Proxy Request Metrics

The proxy layer emits counters and histograms that track request volume, cache behavior, token throughput, and end-to-end latency. Each metric is created inside the HeadroomOtelMetrics class constructor in headroom/observability/metrics.py.

Request Volume and Status Counters

  • headroom.proxy.requests — Total proxy requests handled (lines 38-41).
  • headroom.proxy.requests.cached — Requests served directly from a provider cache (lines 44-46).
  • headroom.proxy.requests.failed — Requests that resulted in an error (lines 48-50).
  • headroom.proxy.requests.rate_limited — Requests rejected by rate-limiting (lines 52-54).

Token and Cache Counters

  • headroom.proxy.tokens.input — Input tokens received by the proxy (lines 58-60).
  • headroom.proxy.tokens.output — Output tokens returned by upstream providers (lines 62-64).
  • headroom.proxy.tokens.saved — Tokens saved by Headroom’s compression logic (lines 66-68).
  • headroom.proxy.cache.read_tokens — Provider-cache read tokens observed (lines 70-72).
  • headroom.proxy.cache.write_tokens — Provider-cache write tokens observed (lines 74-76).
  • headroom.proxy.cache.write_ttl_tokens — Cache-write tokens bucketed by TTL values such as "5m" and "1h" (lines 78-84).
  • headroom.proxy.cache.uncached_input_tokens — Input tokens that missed the cache (lines 86-88).
  • headroom.proxy.cache.busts — Requests that triggered a cache bust (lines 90-92).
  • headroom.proxy.cache.bust_tokens_lost — Tokens lost because of a cache bust (lines 94-96).

Latency and Duration Histograms

  • headroom.proxy.request.duration — End-to-end request latency in seconds (lines 98-100).
  • headroom.proxy.overhead.duration — Time spent inside Headroom’s optimization logic in seconds (lines 102-104).
  • headroom.proxy.ttfb.duration — Time-to-first-byte from the upstream provider in seconds (lines 106-108).

Compression Pipeline Metrics

The compression subsystem exposes counters for throughput and failures, histograms for stage timing, and metadata about individual transforms and waste detection. All of these live in the same headroom/observability/metrics.py module and are surfaced via record_pipeline_run() and record_compression_failure().

Execution and Failure Counters

  • headroom.compression.runs — Number of compression pipeline executions (lines 110-112).
  • headroom.compression.failures — Compression attempts that failed before producing a result (lines 114-116).

Token Throughput Counters

  • headroom.compression.tokens.input — Tokens fed into the compression pipeline (lines 118-120).
  • headroom.compression.tokens.output — Tokens emitted by the compression pipeline (lines 122-124).
  • headroom.compression.tokens.saved — Net tokens removed by compression, calculated as input minus output (lines 126-128).

Timing and Transform Histograms

  • headroom.compression.pipeline.duration — Total time spent running the compression pipeline in seconds (lines 130-132).
  • headroom.compression.stage.duration — Per-stage timing inside the pipeline in seconds (lines 134-136).
  • headroom.compression.transforms — Count of each transform applied during compression, such as smart_crusher or diff_compressor (lines 138-140).

Waste Detection Counters

  • headroom.compression.waste.tokens — Tokens identified as “waste” by waste-signal detectors like empty_lines (lines 142-144).

Subscription Window and Overage Gauges

Headroom provides observable gauges that report Anthropic-specific rate-limit and billing data. These are defined in headroom/observability/metrics.py within the HeadroomOtelMetrics constructor and report values as percentages, durations, or USD.

  • headroom.subscription.5h_utilization_pct — Anthropic 5-hour rate-limit window utilization, reported as a 0-100 percentage.
  • headroom.subscription.7d_utilization_pct — Anthropic 7-day rate-limit window utilization, reported as a 0-100 percentage.
  • headroom.subscription.5h_seconds_to_reset — Seconds until the 5-hour window resets.
  • headroom.subscription.7d_seconds_to_reset — Seconds until the 7-day window resets.
  • headroom.subscription.overage_usd — Anthropic extra-usage credits consumed, measured in USD.

Recording Metrics in Production

The HeadroomOtelMetrics class exposes helper methods such as record_proxy_request(), record_pipeline_run(), and record_compression_failure() that increment counters and record histogram values in a single call. The following examples show how to emit telemetry from application code.

Proxy Request Example

from headroom.observability.metrics import get_otel_metrics, configure_otel_metrics

# Enable OTEL export (normally done at process start)

configure_otel_metrics()

metrics = get_otel_metrics()

# Record a successful proxy request

metrics.record_proxy_request(
    provider="openai",
    model="gpt-4o",
    input_tokens=1200,
    output_tokens=800,
    tokens_saved=400,
    latency_ms=210.5,
    cached=False,
    overhead_ms=15.2,
    ttfb_ms=85.0,
    cache_read_tokens=0,
    cache_write_tokens=0,
    cache_write_5m_tokens=0,
    cache_write_1h_tokens=0,
    uncached_input_tokens=1200,
)

Compression Pipeline Example


# Record a compression pipeline run

metrics.record_pipeline_run(
    model="gpt-4o-mini",
    provider="anthropic",
    tokens_before=1500,
    tokens_after=900,
    duration_ms=45.3,
    transforms_applied=["smart_crusher", "diff_compressor"],
    timing={"smart_crusher": 12.5, "diff_compressor": 30.1},
    waste_signals={"empty_lines": 10},
)

Compression Failure Example


# Record a compression failure

metrics.record_compression_failure(
    model="gpt-3.5-turbo",
    operation="smart_crusher",
    error_type="TimeoutError",
)

These methods update the counters, histograms, and gauges defined in headroom/observability/metrics.py and export them through the configured OTLP endpoint or console.

Enabling and Configuring the Exporter

Export behavior is controlled by environment variables read during configure_otel_metrics(). The key variables include:

  • HEADROOM_OTEL_METRICS_ENABLED — toggles the OTel metrics pipeline.
  • HEADROOM_OTEL_METRICS_EXPORTER — selects the exporter backend, such as console or otlp.
  • HEADROOM_OTEL_METRICS_ENDPOINT — sets the target OTLP endpoint URL.

When OTel is disabled, get_otel_metrics() returns a no-op instance that silently discards recordings so that production code does not require branching logic.

Core Observability Source Files

The full monitoring surface is implemented across a small set of authoritative files:

Summary

  • Headroom exposes more than 28 OpenTelemetry counters, histograms, and observable gauges through the HeadroomOtelMetrics façade in headroom/observability/metrics.py.
  • Proxy metrics cover request volume, cache hits, token savings, and end-to-end latency.
  • Compression metrics capture pipeline runs, per-stage timing, transform counts, and waste-signal detection.
  • Subscription gauges surface Anthropic rate-limit utilization and overage costs.
  • configure_otel_metrics() and get_otel_metrics() provide a zero-friction setup that falls back to a no-op implementation when OTel is disabled.

Frequently Asked Questions

How do I enable OpenTelemetry metrics in Headroom?

Call configure_otel_metrics() at process startup and set the environment variable HEADROOM_OTEL_METRICS_ENABLED to a truthy value. You can also set HEADROOM_OTEL_METRICS_EXPORTER and HEADROOM_OTEL_METRICS_ENDPOINT to route data to an OTLP backend rather than the console.

What is the difference between headroom.proxy.tokens.saved and headroom.compression.tokens.saved?

headroom.proxy.tokens.saved measures net tokens saved at the proxy layer after all optimizations, while headroom.compression.tokens.saved isolates the reduction achieved strictly inside the compression pipeline by comparing input and output token counts. The proxy metric may include savings from caching or other upstream logic, whereas the compression metric is scoped to the pipeline alone.

Which metric tracks upstream provider latency?

The headroom.proxy.ttfb.duration histogram records time-to-first-byte from the upstream provider in seconds. For the total request lifecycle—including Headroom’s own overhead—you should also monitor headroom.proxy.request.duration.

How does Headroom report rate-limit utilization?

Headroom exposes two observable gauges, headroom.subscription.5h_utilization_pct and headroom.subscription.7d_utilization_pct, which report Anthropic window utilization as a 0-100 percentage. Complementary gauges such as headroom.subscription.5h_seconds_to_reset and headroom.subscription.7d_seconds_to_reset tell you exactly how long remains until each window refreshes.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →