How Headroom Achieves Image Compression with ML Routing for Significant Token Reduction

Headroom reduces LLM image token costs by up to 90% through a three-stage pipeline that combines geometric tile optimization, lightweight on-device ML routing to select the optimal compression technique, and execution of OCR transcoding or intelligent resizing based on image content and user query context.

The open-source chopratejas/headroom library implements sophisticated image compression with ML routing to minimize token expenditure when sending images to providers like OpenAI, Anthropic, and Google. By analyzing both the visual content and the user's textual query, the system dynamically selects between OCR transcoding, low-detail resizing, or cropping to preserve relevant information while dramatically reducing payload size.

Three-Stage Compression Pipeline

The ImageCompressor class in headroom/image/compressor.py orchestrates a deterministic pipeline that processes images before they reach the LLM provider.

Stage 1: Tile-Boundary Optimization

Before invoking any machine learning models, the system runs a pure-math optimizer via tile_optimizer.optimize_images_in_messages (lines 336-344). This step trims unused image tiles without quality loss, immediately reducing token counts according to the OpenAI formula (85 tokens per 512×512 tile plus 170 base tokens). Savings from this stage are added to the final accounting before any ML work begins.

Stage 2: ML Routing Decision

If geometric optimization is insufficient, the system employs a lazy-loaded ML router to analyze the raw image bytes alongside the user's query. The router—implemented in either headroom/image/onnx_router.py (ONNX runtime) or headroom/image/trained_router.py (PyTorch fallback)—receives the image payload and extracted query text, then returns a Technique enum value with a confidence score (lines 500-506). Available techniques include:

  • preserve – Retain original quality when the query requires fine detail
  • full_low – Reduce to low detail for general context questions
  • crop – Intelligently crop to relevant regions
  • transcode – Convert to text via OCR for document-style images

Stage 3: Technique Application

The _apply_compression() method (lines 388-406, 511-571) executes the router's decision across all detected image blocks:

  • Transcode: Runs OCR using _ocr_extract() and substitutes the image with a text block prefixed by "[OCR from image]" if the OCR confidence exceeds the configured threshold.
  • Full low / Crop: For OpenAI, the method switches the detail flag to "low"; for Anthropic and Google, it resizes images to maximum dimensions (512px for Anthropic, 768px for Google) using Pillow, re-encodes as JPEG, and rewrites the base-64 payload.

Core Implementation Details

Image Detection and Data Extraction

The pipeline begins with has_images() (lines 74-88), which walks the message list and recognizes three provider formats: OpenAI's image_url, Anthropic's image, and Google's inlineData. The _extract_image_data() method (lines 119-149) pulls the first base-64 payload while handling all three encoding schemes, and _extract_query() (lines 92-108) captures the latest user text to inform routing decisions with conversational context.

Router Architecture and Lazy Loading

The ML router is resolved lazily within compress() (lines 470-496) to maintain lightweight performance on standard request paths. The system prioritizes the ONNX-based OnnxTechniqueRouter for fast inference, falling back to the PyTorch TrainedRouter only if the ONNX runtime is unavailable. This lazy-loading pattern ensures heavy model states are not initialized unless image compression is actually triggered.

Token Accounting and OCR Backend

Accurate Token Estimation

The compressor tracks savings meticulously using provider-specific calculations. _estimate_tokens() (lines 94-112) calculates original costs using the tile-based formula. Post-compression, _count_result_tokens() (lines 122-144) recounts tokens based on the applied technique: OCR-derived text is measured by character count, resized images are re-estimated using tile dimensions, and low-detail images use the fixed 85-token cost. The CompressionResult object (lines 46-53) stores original and compressed token counts, selected technique, and router confidence, accessible via last_result and last_savings properties.

Flexible OCR Resolution

The OCR engine resolves lazily from either rapidocr-onnxruntime (v1) or rapidocr (v3), caching the selected class and API version in the compressor instance. If neither package is installed, the system gracefully degrades to disable OCR functionality entirely (lines 48-84), ensuring the pipeline remains functional even without heavy dependencies.

Practical Usage Examples


# Basic usage – compress a list of OpenAI‑style messages

from headroom.image import ImageCompressor

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "What does this picture show?"},
            {"type": "image_url",
             "image_url": {"url": "data:image/png;base64,iVBORw0KGgo..."}}
        ],
    }
]

compressor = ImageCompressor()        # loads the router on first use

compressed = compressor.compress(messages, provider="openai")
print("Saved ≈", compressor.last_savings, "% tokens")
compressor.close()                     # releases heavy model state

# Convenience helper – one‑liner for any provider

from headroom.image import compress_images

compressed = compress_images(messages, provider="anthropic")

# Inspect the routing decision (useful for debugging or metrics)

compressor = ImageCompressor()
compressor.compress(messages, provider="openai")
print("Technique chosen:", compressor.last_result.technique)
print("Router confidence:", compressor.last_result.confidence)
compressor.close()

Summary

  • Headroom implements image compression with ML routing through a three-stage pipeline: geometric optimization, ML-based technique selection, and targeted transformation execution.
  • The ImageCompressor class in headroom/image/compressor.py handles detection, routing, and token accounting for OpenAI, Anthropic, and Google providers using lazy-loaded models.
  • ML routing selects from four techniques (preserve, full_low, crop, transcode) based on visual content and user query context, with ONNX inference preferred for speed.
  • Token savings are accurately calculated using provider-specific formulas, with OCR transcoding offering the highest reduction potential for text-heavy images.
  • The architecture uses lazy loading for both ML routers and OCR backends to maintain performance on standard request paths while supporting graceful degradation when dependencies are absent.

Frequently Asked Questions

What compression techniques does Headroom's ML router support?

The router selects from four techniques defined in the Technique enum: preserve (retain original quality), full_low (reduce detail), crop (intelligent cropping), and transcode (OCR text extraction). The choice depends on image content complexity and the user's textual query, with confidence scores returned alongside the decision to indicate certainty levels.

How does Headroom calculate token savings for compressed images?

The system estimates original tokens using the OpenAI formula (85 tokens per 512×512 tile plus 170 base tokens) via _estimate_tokens(). After compression, _count_result_tokens() recalculates based on the applied technique: OCR output is measured by character count, resized images use re-estimated tile counts, and low-detail images use a fixed 85-token cost. The difference is exposed through the last_savings property as a percentage reduction.

Can Headroom work without installing heavy ML dependencies?

Yes. The library implements graceful degradation throughout the pipeline: if the ONNX runtime is unavailable, it falls back to PyTorch; if OCR packages (rapidocr-onnxruntime or rapidocr) are missing, OCR functionality is disabled without crashing. Both the router and OCR engine use lazy-loading patterns, ensuring the compression pipeline initializes quickly and only loads heavy models when specifically invoked.

Which LLM providers does Headroom support for image compression?

Headroom supports OpenAI (GPT-4 Vision), Anthropic (Claude), and Google Gemini. The has_images() method recognizes OpenAI's image_url format, Anthropic's image objects, and Google's inlineData payloads. Provider-specific optimizations exist in _apply_compression(): OpenAI uses the native detail parameter, while Anthropic and Google receive server-side resizing to 512px and 768px respectively before re-encoding.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →