Headroom OpenTelemetry Monitoring and Observability Metrics: Complete Reference
Headroom exposes more than 28 production-grade OpenTelemetry metrics—including counters for proxy requests, token savings, and compression pipelines, histograms for end-to-end latency and per-stage duration, and observable gauges for subscription-window utilization—that are all defined in headroom/observability/metrics.py and emitted through the HeadroomOtelMetrics façade.
The chopratejas/headroom repository ships a dedicated OpenTelemetry (OTel) metrics façade that records production-grade telemetry for every proxy request, compression pipeline run, and subscription-window update. These monitoring and observability metrics are instantiated by configure_otel_metrics()—or the default get_otel_metrics() when OTel is disabled—and exported via standard OTLP endpoints or the console.
Proxy Request Metrics
The proxy layer emits counters and histograms that track request volume, cache behavior, token throughput, and end-to-end latency. Each metric is created inside the HeadroomOtelMetrics class constructor in headroom/observability/metrics.py.
Request Volume and Status Counters
headroom.proxy.requests— Total proxy requests handled (lines 38-41).headroom.proxy.requests.cached— Requests served directly from a provider cache (lines 44-46).headroom.proxy.requests.failed— Requests that resulted in an error (lines 48-50).headroom.proxy.requests.rate_limited— Requests rejected by rate-limiting (lines 52-54).
Token and Cache Counters
headroom.proxy.tokens.input— Input tokens received by the proxy (lines 58-60).headroom.proxy.tokens.output— Output tokens returned by upstream providers (lines 62-64).headroom.proxy.tokens.saved— Tokens saved by Headroom’s compression logic (lines 66-68).headroom.proxy.cache.read_tokens— Provider-cache read tokens observed (lines 70-72).headroom.proxy.cache.write_tokens— Provider-cache write tokens observed (lines 74-76).headroom.proxy.cache.write_ttl_tokens— Cache-write tokens bucketed by TTL values such as"5m"and"1h"(lines 78-84).headroom.proxy.cache.uncached_input_tokens— Input tokens that missed the cache (lines 86-88).headroom.proxy.cache.busts— Requests that triggered a cache bust (lines 90-92).headroom.proxy.cache.bust_tokens_lost— Tokens lost because of a cache bust (lines 94-96).
Latency and Duration Histograms
headroom.proxy.request.duration— End-to-end request latency in seconds (lines 98-100).headroom.proxy.overhead.duration— Time spent inside Headroom’s optimization logic in seconds (lines 102-104).headroom.proxy.ttfb.duration— Time-to-first-byte from the upstream provider in seconds (lines 106-108).
Compression Pipeline Metrics
The compression subsystem exposes counters for throughput and failures, histograms for stage timing, and metadata about individual transforms and waste detection. All of these live in the same headroom/observability/metrics.py module and are surfaced via record_pipeline_run() and record_compression_failure().
Execution and Failure Counters
headroom.compression.runs— Number of compression pipeline executions (lines 110-112).headroom.compression.failures— Compression attempts that failed before producing a result (lines 114-116).
Token Throughput Counters
headroom.compression.tokens.input— Tokens fed into the compression pipeline (lines 118-120).headroom.compression.tokens.output— Tokens emitted by the compression pipeline (lines 122-124).headroom.compression.tokens.saved— Net tokens removed by compression, calculated as input minus output (lines 126-128).
Timing and Transform Histograms
headroom.compression.pipeline.duration— Total time spent running the compression pipeline in seconds (lines 130-132).headroom.compression.stage.duration— Per-stage timing inside the pipeline in seconds (lines 134-136).headroom.compression.transforms— Count of each transform applied during compression, such assmart_crusherordiff_compressor(lines 138-140).
Waste Detection Counters
headroom.compression.waste.tokens— Tokens identified as “waste” by waste-signal detectors likeempty_lines(lines 142-144).
Subscription Window and Overage Gauges
Headroom provides observable gauges that report Anthropic-specific rate-limit and billing data. These are defined in headroom/observability/metrics.py within the HeadroomOtelMetrics constructor and report values as percentages, durations, or USD.
headroom.subscription.5h_utilization_pct— Anthropic 5-hour rate-limit window utilization, reported as a 0-100 percentage.headroom.subscription.7d_utilization_pct— Anthropic 7-day rate-limit window utilization, reported as a 0-100 percentage.headroom.subscription.5h_seconds_to_reset— Seconds until the 5-hour window resets.headroom.subscription.7d_seconds_to_reset— Seconds until the 7-day window resets.headroom.subscription.overage_usd— Anthropic extra-usage credits consumed, measured in USD.
Recording Metrics in Production
The HeadroomOtelMetrics class exposes helper methods such as record_proxy_request(), record_pipeline_run(), and record_compression_failure() that increment counters and record histogram values in a single call. The following examples show how to emit telemetry from application code.
Proxy Request Example
from headroom.observability.metrics import get_otel_metrics, configure_otel_metrics
# Enable OTEL export (normally done at process start)
configure_otel_metrics()
metrics = get_otel_metrics()
# Record a successful proxy request
metrics.record_proxy_request(
provider="openai",
model="gpt-4o",
input_tokens=1200,
output_tokens=800,
tokens_saved=400,
latency_ms=210.5,
cached=False,
overhead_ms=15.2,
ttfb_ms=85.0,
cache_read_tokens=0,
cache_write_tokens=0,
cache_write_5m_tokens=0,
cache_write_1h_tokens=0,
uncached_input_tokens=1200,
)
Compression Pipeline Example
# Record a compression pipeline run
metrics.record_pipeline_run(
model="gpt-4o-mini",
provider="anthropic",
tokens_before=1500,
tokens_after=900,
duration_ms=45.3,
transforms_applied=["smart_crusher", "diff_compressor"],
timing={"smart_crusher": 12.5, "diff_compressor": 30.1},
waste_signals={"empty_lines": 10},
)
Compression Failure Example
# Record a compression failure
metrics.record_compression_failure(
model="gpt-3.5-turbo",
operation="smart_crusher",
error_type="TimeoutError",
)
These methods update the counters, histograms, and gauges defined in headroom/observability/metrics.py and export them through the configured OTLP endpoint or console.
Enabling and Configuring the Exporter
Export behavior is controlled by environment variables read during configure_otel_metrics(). The key variables include:
HEADROOM_OTEL_METRICS_ENABLED— toggles the OTel metrics pipeline.HEADROOM_OTEL_METRICS_EXPORTER— selects the exporter backend, such asconsoleorotlp.HEADROOM_OTEL_METRICS_ENDPOINT— sets the target OTLP endpoint URL.
When OTel is disabled, get_otel_metrics() returns a no-op instance that silently discards recordings so that production code does not require branching logic.
Core Observability Source Files
The full monitoring surface is implemented across a small set of authoritative files:
headroom/observability/metrics.py— Core OTel metrics definitions andHeadroomOtelMetricshelpers (view source).headroom/observability/tracing.py— OTel tracing façade including spans and Langfuse integration (view source).headroom/transforms/observability.py—CompressionObserverprotocol that downstream transforms call to emit per-event data consumed by the metrics façade (view source).tests/test_telemetry.py— Unit tests confirming metric emission behavior for the signals above (view source).
Summary
- Headroom exposes more than 28 OpenTelemetry counters, histograms, and observable gauges through the
HeadroomOtelMetricsfaçade inheadroom/observability/metrics.py. - Proxy metrics cover request volume, cache hits, token savings, and end-to-end latency.
- Compression metrics capture pipeline runs, per-stage timing, transform counts, and waste-signal detection.
- Subscription gauges surface Anthropic rate-limit utilization and overage costs.
configure_otel_metrics()andget_otel_metrics()provide a zero-friction setup that falls back to a no-op implementation when OTel is disabled.
Frequently Asked Questions
How do I enable OpenTelemetry metrics in Headroom?
Call configure_otel_metrics() at process startup and set the environment variable HEADROOM_OTEL_METRICS_ENABLED to a truthy value. You can also set HEADROOM_OTEL_METRICS_EXPORTER and HEADROOM_OTEL_METRICS_ENDPOINT to route data to an OTLP backend rather than the console.
What is the difference between headroom.proxy.tokens.saved and headroom.compression.tokens.saved?
headroom.proxy.tokens.saved measures net tokens saved at the proxy layer after all optimizations, while headroom.compression.tokens.saved isolates the reduction achieved strictly inside the compression pipeline by comparing input and output token counts. The proxy metric may include savings from caching or other upstream logic, whereas the compression metric is scoped to the pipeline alone.
Which metric tracks upstream provider latency?
The headroom.proxy.ttfb.duration histogram records time-to-first-byte from the upstream provider in seconds. For the total request lifecycle—including Headroom’s own overhead—you should also monitor headroom.proxy.request.duration.
How does Headroom report rate-limit utilization?
Headroom exposes two observable gauges, headroom.subscription.5h_utilization_pct and headroom.subscription.7d_utilization_pct, which report Anthropic window utilization as a 0-100 percentage. Complementary gauges such as headroom.subscription.5h_seconds_to_reset and headroom.subscription.7d_seconds_to_reset tell you exactly how long remains until each window refreshes.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →