deep-dive

How Headroom's Image Compression Works with the Trained ML Router: A Three-Stage Pipeline

June 6, 2026 chopratejas/headroom ↗

Headroom optimizes images in LLM chat messages through a three-stage pipeline that combines tile-boundary mathematics with a trained Mini-LM and SigLIP router to intelligently select between OCR transcoding, cropping, or low-detail compression.

The chopratejas/headroom repository implements an intelligent image compression system specifically designed for LLM API cost optimization. Headroom's image compression analyzes both the user's textual query and the visual content to determine the optimal compression technique, reducing token costs while preserving information critical to the conversation.

The Three-Stage Compression Pipeline

Headroom processes every image through three distinct stages, each handled by specialized modules in the headroom/image/ directory.

Stage 1: Tile-Boundary Optimization

The first stage applies pure mathematics to resize images onto provider-specific tile boundaries without quality loss. This reduces token counts before any ML analysis begins.

In headroom/image/tile_optimizer.py, the functions estimate_openai_tokens and estimate_anthropic_tokens calculate provider-specific costs (OpenAI uses 512px tiles, Anthropic uses approximately 750px² per token). The optimize_images_in_messages function then resizes images to optimal dimensions, returning immediate token savings (tile_saved) that require no ML inference.

Stage 2: ML-Based Technique Routing

The second stage employs a trained ML router that analyzes both the user's query intent and image characteristics to select the optimal compression strategy.

The router implementation lives in two files:

headroom/image/trained_router.py – PyTorch implementation using Mini-LM for query classification and SigLIP for image analysis
headroom/image/onnx_router.py – Production-ready ONNX fallback (~32MB classifier + ~95MB SigLIP) that runs on CPU without PyTorch dependencies

The router is lazily loaded via _get_router() in headroom/image/compressor.py only when first needed.

Stage 3: Technique Application

The final stage executes the chosen compression technique based on the router's RouteDecision. The _apply_compression method in headroom/image/compressor.py implements three provider formats (OpenAI, Anthropic, Google) for each technique:

TRANSCODE: OCR extraction via RapidOCR (supports v1 and v3 APIs)
CROP / FULL-LOW: Dimension-based resizing with JPEG compression
PRESERVE: Passing the image unchanged

How the ML Router Makes Routing Decisions

The router's decision flow combines textual intent classification with visual signal extraction.

Query Intent Classification

The classify_query() method uses Mini-LM to predict a Technique enum value (TRANSCODE, CROP, FULL_LOW, or PRESERVE) along with a confidence score. The method extracts the query text by walking messages in reverse order via _extract_query(), concatenating multi-part blocks if necessary.

Image Signal Analysis

When use_siglip is enabled, analyze_image() extracts four critical signals:

has_text – Presence of readable text
is_document – Document-like structure
is_complex – Visual complexity
has_small_details – Fine detail presence

These signals adjust the confidence scores. For example, the router lowers confidence for TRANSCODE when SigLIP detects no text in the image, preventing wasted OCR attempts.

Final Route Decision

The RouteDecision object contains the chosen technique, confidence score, reasoning string, and raw image signals. This decision drives whether the system runs OCR, resizes the image, or preserves quality based on the query's needs.

Compression Techniques Explained

Each technique in headroom/image/compressor.py serves specific cost-optimization scenarios.

Transcode

The TRANSCODE technique runs RapidOCR via _ocr_extract() to convert images containing text into text blocks. Upon successful extraction, the image is replaced with a [OCR from image] text block. This eliminates image token costs entirely when the user needs only the text content.

Crop and Full-Low

For CROP or FULL_LOW decisions:

OpenAI implementations set detail: "low" in the message block
Anthropic and Google implementations use _resize_image() to create low-detail JPEGs with maximum dimension constraints

This balances token savings against the need for visual understanding.

Preserve

The PRESERVE technique passes images unchanged when the router determines that full detail is necessary for the query (e.g., "describe this complex diagram in detail").

Implementation Code Examples

Basic Usage with ImageCompressor

from headroom.image import ImageCompressor

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "What does this diagram show?"},
            {
                "type": "image_url",
                "image_url": {
                    "url": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUg..."
                },
            },
        ],
    }
]

compressor = ImageCompressor()
compressed = compressor.compress(messages, provider="openai")
print("Saved:", compressor.last_savings, "%")

Source: [headroom/image/compressor.py](https://github.com/chopratejas/headroom/blob/main/headroom/image/compressor.py)

Convenience Function

from headroom.image import compress_images

compressed = compress_images(messages, provider="anthropic")

Source: [headroom/image/compressor.py](https://github.com/chopratejas/headroom/blob/main/headroom/image/compressor.py)

Direct Router Invocation

from headroom.image.trained_router import TrainedRouter, Technique

router = TrainedRouter(use_siglip=True)          # Loads Mini-LM + SigLIP

with open("my_photo.png", "rb") as f:
    img_bytes = f.read()

decision = router.classify(img_bytes, "extract the text")
print(decision.technique, decision.confidence, decision.reason)

Source: [headroom/image/trained_router.py](https://github.com/chopratejas/headroom/blob/main/headroom/image/trained_router.py)

Summary

Headroom's image compression uses a three-stage pipeline: tile optimization, ML routing, and technique application.
The trained ML router combines Mini-LM for query intent and SigLIP for image analysis to select between transcoding, cropping, or preserving images.
Implementation files include headroom/image/compressor.py for orchestration, trained_router.py for PyTorch inference, and onnx_router.py for lightweight CPU-only production deployment.
The router makes query-aware decisions that override naive compression, ensuring OCR only runs when text is detected and low-detail mode only activates when visual fidelity is unnecessary.
Token accounting via _count_result_tokens() provides measurable savings percentages through CompressionResult objects.

Frequently Asked Questions

What machine learning models power Headroom's image compression router?

The router uses a Mini-LM sentence transformer for classifying user query intent and SigLIP (Sigmoid Loss for Language Image Pre-Training) for analyzing image content. These models detect whether images contain text, represent documents, or contain complex details that require high-resolution preservation.

How does Headroom choose between PyTorch and ONNX inference?

The system attempts to load the ONNX router (onnx_router.py) by default for production deployments, as it requires only ~127MB of model weights and runs efficiently on CPU. If ONNX loading fails, or if the test suite has monkey-patched _get_router(), the system falls back to the PyTorch router (trained_router.py) which requires full PyTorch dependencies.

What are the token savings from Headroom's compression techniques?

Savings vary by technique and provider. Tile-boundary optimization provides immediate mathematical savings by forcing images onto provider-specific grids (OpenAI's 512px tiles or Anthropic's density calculations). Transcoding provides maximum savings by replacing images entirely with text tokens. The CompressionResult object exposes savings_percent and original/compressed token counts for monitoring.

Does the ML router work offline after initial download?

Yes. Both the PyTorch and ONNX routers lazy-load model weights from HuggingFace on first invocation, then cache them locally. Once downloaded, all inference runs completely offline. The ONNX runtime specifically uses headroom/image/onnx_runtime.py to handle model downloading and session creation for air-gapped environments.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:

curl -s "https://instagit.com/install.md"

Add to your MCP client configuration:

{
  "mcpServers": {
    "instagit": {
      "command": "npx",
      "args": ["-y", "instagit@latest"]
    }
  }
}

Ask your agent:

"Use Instagit MCP to understand how chopratejas/headroom works."

Works with

Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →