# How Headroom Achieves Image Compression with ML Routing for Significant Token Reduction

> Discover how Headroom slashes LLM image token costs by 90% using ML routing and intelligent image compression techniques. Reduce costs significantly.

- Repository: [Tejas Chopra/headroom](https://github.com/chopratejas/headroom)
- Tags: deep-dive
- Published: 2026-06-09

---

**Headroom reduces LLM image token costs by up to 90% through a three-stage pipeline that combines geometric tile optimization, lightweight on-device ML routing to select the optimal compression technique, and execution of OCR transcoding or intelligent resizing based on image content and user query context.**

The open-source `chopratejas/headroom` library implements sophisticated **image compression with ML routing** to minimize token expenditure when sending images to providers like OpenAI, Anthropic, and Google. By analyzing both the visual content and the user's textual query, the system dynamically selects between OCR transcoding, low-detail resizing, or cropping to preserve relevant information while dramatically reducing payload size.

## Three-Stage Compression Pipeline

The `ImageCompressor` class in [`headroom/image/compressor.py`](https://github.com/chopratejas/headroom/blob/main/headroom/image/compressor.py) orchestrates a deterministic pipeline that processes images before they reach the LLM provider.

### Stage 1: Tile-Boundary Optimization

Before invoking any machine learning models, the system runs a pure-math optimizer via `tile_optimizer.optimize_images_in_messages` (lines 336-344). This step trims unused image tiles without quality loss, immediately reducing token counts according to the OpenAI formula (85 tokens per 512×512 tile plus 170 base tokens). Savings from this stage are added to the final accounting before any ML work begins.

### Stage 2: ML Routing Decision

If geometric optimization is insufficient, the system employs a **lazy-loaded ML router** to analyze the raw image bytes alongside the user's query. The router—implemented in either [`headroom/image/onnx_router.py`](https://github.com/chopratejas/headroom/blob/main/headroom/image/onnx_router.py) (ONNX runtime) or [`headroom/image/trained_router.py`](https://github.com/chopratejas/headroom/blob/main/headroom/image/trained_router.py) (PyTorch fallback)—receives the image payload and extracted query text, then returns a `Technique` enum value with a confidence score (lines 500-506). Available techniques include:

- **`preserve`** – Retain original quality when the query requires fine detail
- **`full_low`** – Reduce to low detail for general context questions
- **`crop`** – Intelligently crop to relevant regions
- **`transcode`** – Convert to text via OCR for document-style images

### Stage 3: Technique Application

The `_apply_compression()` method (lines 388-406, 511-571) executes the router's decision across all detected image blocks:

- **Transcode**: Runs OCR using `_ocr_extract()` and substitutes the image with a text block prefixed by `"[OCR from image]"` if the OCR confidence exceeds the configured threshold.
- **Full low / Crop**: For OpenAI, the method switches the `detail` flag to `"low"`; for Anthropic and Google, it resizes images to maximum dimensions (512px for Anthropic, 768px for Google) using Pillow, re-encodes as JPEG, and rewrites the base-64 payload.

## Core Implementation Details

### Image Detection and Data Extraction

The pipeline begins with `has_images()` (lines 74-88), which walks the message list and recognizes three provider formats: OpenAI's `image_url`, Anthropic's `image`, and Google's `inlineData`. The `_extract_image_data()` method (lines 119-149) pulls the first base-64 payload while handling all three encoding schemes, and `_extract_query()` (lines 92-108) captures the latest user text to inform routing decisions with conversational context.

### Router Architecture and Lazy Loading

The ML router is resolved lazily within `compress()` (lines 470-496) to maintain lightweight performance on standard request paths. The system prioritizes the **ONNX-based `OnnxTechniqueRouter`** for fast inference, falling back to the PyTorch **`TrainedRouter`** only if the ONNX runtime is unavailable. This lazy-loading pattern ensures heavy model states are not initialized unless image compression is actually triggered.

## Token Accounting and OCR Backend

### Accurate Token Estimation

The compressor tracks savings meticulously using provider-specific calculations. `_estimate_tokens()` (lines 94-112) calculates original costs using the tile-based formula. Post-compression, `_count_result_tokens()` (lines 122-144) recounts tokens based on the applied technique: OCR-derived text is measured by character count, resized images are re-estimated using tile dimensions, and low-detail images use the fixed 85-token cost. The `CompressionResult` object (lines 46-53) stores original and compressed token counts, selected technique, and router confidence, accessible via `last_result` and `last_savings` properties.

### Flexible OCR Resolution

The OCR engine resolves lazily from either `rapidocr-onnxruntime` (v1) or `rapidocr` (v3), caching the selected class and API version in the compressor instance. If neither package is installed, the system gracefully degrades to disable OCR functionality entirely (lines 48-84), ensuring the pipeline remains functional even without heavy dependencies.

## Practical Usage Examples

```python

# Basic usage – compress a list of OpenAI‑style messages

from headroom.image import ImageCompressor

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "What does this picture show?"},
            {"type": "image_url",
             "image_url": {"url": "data:image/png;base64,iVBORw0KGgo..."}}
        ],
    }
]

compressor = ImageCompressor()        # loads the router on first use

compressed = compressor.compress(messages, provider="openai")
print("Saved ≈", compressor.last_savings, "% tokens")
compressor.close()                     # releases heavy model state

```

```python

# Convenience helper – one‑liner for any provider

from headroom.image import compress_images

compressed = compress_images(messages, provider="anthropic")

```

```python

# Inspect the routing decision (useful for debugging or metrics)

compressor = ImageCompressor()
compressor.compress(messages, provider="openai")
print("Technique chosen:", compressor.last_result.technique)
print("Router confidence:", compressor.last_result.confidence)
compressor.close()

```

## Summary

- Headroom implements **image compression with ML routing** through a three-stage pipeline: geometric optimization, ML-based technique selection, and targeted transformation execution.
- The `ImageCompressor` class in [`headroom/image/compressor.py`](https://github.com/chopratejas/headroom/blob/main/headroom/image/compressor.py) handles detection, routing, and token accounting for OpenAI, Anthropic, and Google providers using lazy-loaded models.
- ML routing selects from four techniques (`preserve`, `full_low`, `crop`, `transcode`) based on visual content and user query context, with ONNX inference preferred for speed.
- Token savings are accurately calculated using provider-specific formulas, with OCR transcoding offering the highest reduction potential for text-heavy images.
- The architecture uses lazy loading for both ML routers and OCR backends to maintain performance on standard request paths while supporting graceful degradation when dependencies are absent.

## Frequently Asked Questions

### What compression techniques does Headroom's ML router support?

The router selects from four techniques defined in the `Technique` enum: `preserve` (retain original quality), `full_low` (reduce detail), `crop` (intelligent cropping), and `transcode` (OCR text extraction). The choice depends on image content complexity and the user's textual query, with confidence scores returned alongside the decision to indicate certainty levels.

### How does Headroom calculate token savings for compressed images?

The system estimates original tokens using the OpenAI formula (85 tokens per 512×512 tile plus 170 base tokens) via `_estimate_tokens()`. After compression, `_count_result_tokens()` recalculates based on the applied technique: OCR output is measured by character count, resized images use re-estimated tile counts, and low-detail images use a fixed 85-token cost. The difference is exposed through the `last_savings` property as a percentage reduction.

### Can Headroom work without installing heavy ML dependencies?

Yes. The library implements graceful degradation throughout the pipeline: if the ONNX runtime is unavailable, it falls back to PyTorch; if OCR packages (`rapidocr-onnxruntime` or `rapidocr`) are missing, OCR functionality is disabled without crashing. Both the router and OCR engine use lazy-loading patterns, ensuring the compression pipeline initializes quickly and only loads heavy models when specifically invoked.

### Which LLM providers does Headroom support for image compression?

Headroom supports OpenAI (GPT-4 Vision), Anthropic (Claude), and Google Gemini. The `has_images()` method recognizes OpenAI's `image_url` format, Anthropic's `image` objects, and Google's `inlineData` payloads. Provider-specific optimizations exist in `_apply_compression()`: OpenAI uses the native `detail` parameter, while Anthropic and Google receive server-side resizing to 512px and 768px respectively before re-encoding.