# How Headroom's Image Compression Works with the Trained ML Router: A Three-Stage Pipeline

> Discover how Headroom's image compression uses a trained ML router and a three-stage pipeline for intelligent OCR transcoding, cropping, or low-detail compression. Optimize your LLM chat messages.

- Repository: [Tejas Chopra/headroom](https://github.com/chopratejas/headroom)
- Tags: deep-dive
- Published: 2026-06-06

---

**Headroom optimizes images in LLM chat messages through a three-stage pipeline that combines tile-boundary mathematics with a trained Mini-LM and SigLIP router to intelligently select between OCR transcoding, cropping, or low-detail compression.**

The `chopratejas/headroom` repository implements an intelligent image compression system specifically designed for LLM API cost optimization. Headroom's image compression analyzes both the user's textual query and the visual content to determine the optimal compression technique, reducing token costs while preserving information critical to the conversation.

## The Three-Stage Compression Pipeline

Headroom processes every image through three distinct stages, each handled by specialized modules in the `headroom/image/` directory.

### Stage 1: Tile-Boundary Optimization

The first stage applies pure mathematics to resize images onto provider-specific tile boundaries without quality loss. This reduces token counts before any ML analysis begins.

In [`headroom/image/tile_optimizer.py`](https://github.com/chopratejas/headroom/blob/main/headroom/image/tile_optimizer.py), the functions `estimate_openai_tokens` and `estimate_anthropic_tokens` calculate provider-specific costs (OpenAI uses 512px tiles, Anthropic uses approximately 750px² per token). The `optimize_images_in_messages` function then resizes images to optimal dimensions, returning immediate token savings (`tile_saved`) that require no ML inference.

### Stage 2: ML-Based Technique Routing

The second stage employs a **trained ML router** that analyzes both the user's query intent and image characteristics to select the optimal compression strategy.

The router implementation lives in two files:
- [`headroom/image/trained_router.py`](https://github.com/chopratejas/headroom/blob/main/headroom/image/trained_router.py) – PyTorch implementation using **Mini-LM** for query classification and **SigLIP** for image analysis
- [`headroom/image/onnx_router.py`](https://github.com/chopratejas/headroom/blob/main/headroom/image/onnx_router.py) – Production-ready ONNX fallback (~32MB classifier + ~95MB SigLIP) that runs on CPU without PyTorch dependencies

The router is lazily loaded via `_get_router()` in [`headroom/image/compressor.py`](https://github.com/chopratejas/headroom/blob/main/headroom/image/compressor.py) only when first needed.

### Stage 3: Technique Application

The final stage executes the chosen compression technique based on the router's `RouteDecision`. The `_apply_compression` method in [`headroom/image/compressor.py`](https://github.com/chopratejas/headroom/blob/main/headroom/image/compressor.py) implements three provider formats (OpenAI, Anthropic, Google) for each technique:
- **TRANSCODE**: OCR extraction via RapidOCR (supports v1 and v3 APIs)
- **CROP / FULL-LOW**: Dimension-based resizing with JPEG compression
- **PRESERVE**: Passing the image unchanged

## How the ML Router Makes Routing Decisions

The router's decision flow combines textual intent classification with visual signal extraction.

### Query Intent Classification

The `classify_query()` method uses **Mini-LM** to predict a `Technique` enum value (`TRANSCODE`, `CROP`, `FULL_LOW`, or `PRESERVE`) along with a confidence score. The method extracts the query text by walking messages in reverse order via `_extract_query()`, concatenating multi-part blocks if necessary.

### Image Signal Analysis

When `use_siglip` is enabled, `analyze_image()` extracts four critical signals:
- `has_text` – Presence of readable text
- `is_document` – Document-like structure
- `is_complex` – Visual complexity
- `has_small_details` – Fine detail presence

These signals adjust the confidence scores. For example, the router lowers confidence for **TRANSCODE** when SigLIP detects no text in the image, preventing wasted OCR attempts.

### Final Route Decision

The `RouteDecision` object contains the chosen technique, confidence score, reasoning string, and raw image signals. This decision drives whether the system runs OCR, resizes the image, or preserves quality based on the query's needs.

## Compression Techniques Explained

Each technique in [`headroom/image/compressor.py`](https://github.com/chopratejas/headroom/blob/main/headroom/image/compressor.py) serves specific cost-optimization scenarios.

### Transcode

The **TRANSCODE** technique runs RapidOCR via `_ocr_extract()` to convert images containing text into text blocks. Upon successful extraction, the image is replaced with a `[OCR from image]` text block. This eliminates image token costs entirely when the user needs only the text content.

### Crop and Full-Low

For **CROP** or **FULL_LOW** decisions:
- OpenAI implementations set `detail: "low"` in the message block
- Anthropic and Google implementations use `_resize_image()` to create low-detail JPEGs with maximum dimension constraints

This balances token savings against the need for visual understanding.

### Preserve

The **PRESERVE** technique passes images unchanged when the router determines that full detail is necessary for the query (e.g., "describe this complex diagram in detail").

## Implementation Code Examples

### Basic Usage with ImageCompressor

```python
from headroom.image import ImageCompressor

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "What does this diagram show?"},
            {
                "type": "image_url",
                "image_url": {
                    "url": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUg..."
                },
            },
        ],
    }
]

compressor = ImageCompressor()
compressed = compressor.compress(messages, provider="openai")
print("Saved:", compressor.last_savings, "%")

```

*Source:* [[`headroom/image/compressor.py`](https://github.com/chopratejas/headroom/blob/main/headroom/image/compressor.py)](https://github.com/chopratejas/headroom/blob/main/headroom/image/compressor.py)

### Convenience Function

```python
from headroom.image import compress_images

compressed = compress_images(messages, provider="anthropic")

```

*Source:* [[`headroom/image/compressor.py`](https://github.com/chopratejas/headroom/blob/main/headroom/image/compressor.py)](https://github.com/chopratejas/headroom/blob/main/headroom/image/compressor.py)

### Direct Router Invocation

```python
from headroom.image.trained_router import TrainedRouter, Technique

router = TrainedRouter(use_siglip=True)          # Loads Mini-LM + SigLIP

with open("my_photo.png", "rb") as f:
    img_bytes = f.read()

decision = router.classify(img_bytes, "extract the text")
print(decision.technique, decision.confidence, decision.reason)

```

*Source:* [[`headroom/image/trained_router.py`](https://github.com/chopratejas/headroom/blob/main/headroom/image/trained_router.py)](https://github.com/chopratejas/headroom/blob/main/headroom/image/trained_router.py)

## Summary

- **Headroom's image compression** uses a three-stage pipeline: tile optimization, ML routing, and technique application.
- The **trained ML router** combines Mini-LM for query intent and SigLIP for image analysis to select between transcoding, cropping, or preserving images.
- Implementation files include [`headroom/image/compressor.py`](https://github.com/chopratejas/headroom/blob/main/headroom/image/compressor.py) for orchestration, [`trained_router.py`](https://github.com/chopratejas/headroom/blob/main/trained_router.py) for PyTorch inference, and [`onnx_router.py`](https://github.com/chopratejas/headroom/blob/main/onnx_router.py) for lightweight CPU-only production deployment.
- The router makes query-aware decisions that override naive compression, ensuring OCR only runs when text is detected and low-detail mode only activates when visual fidelity is unnecessary.
- Token accounting via `_count_result_tokens()` provides measurable savings percentages through `CompressionResult` objects.

## Frequently Asked Questions

### What machine learning models power Headroom's image compression router?

The router uses a **Mini-LM** sentence transformer for classifying user query intent and **SigLIP** (Sigmoid Loss for Language Image Pre-Training) for analyzing image content. These models detect whether images contain text, represent documents, or contain complex details that require high-resolution preservation.

### How does Headroom choose between PyTorch and ONNX inference?

The system attempts to load the **ONNX router** ([`onnx_router.py`](https://github.com/chopratejas/headroom/blob/main/onnx_router.py)) by default for production deployments, as it requires only ~127MB of model weights and runs efficiently on CPU. If ONNX loading fails, or if the test suite has monkey-patched `_get_router()`, the system falls back to the **PyTorch router** ([`trained_router.py`](https://github.com/chopratejas/headroom/blob/main/trained_router.py)) which requires full PyTorch dependencies.

### What are the token savings from Headroom's compression techniques?

Savings vary by technique and provider. **Tile-boundary optimization** provides immediate mathematical savings by forcing images onto provider-specific grids (OpenAI's 512px tiles or Anthropic's density calculations). **Transcoding** provides maximum savings by replacing images entirely with text tokens. The `CompressionResult` object exposes `savings_percent` and original/compressed token counts for monitoring.

### Does the ML router work offline after initial download?

Yes. Both the PyTorch and ONNX routers lazy-load model weights from HuggingFace on first invocation, then cache them locally. Once downloaded, all inference runs completely offline. The ONNX runtime specifically uses [`headroom/image/onnx_runtime.py`](https://github.com/chopratejas/headroom/blob/main/headroom/image/onnx_runtime.py) to handle model downloading and session creation for air-gapped environments.