How Headroom's Image Compression Works with the Trained ML Router: A Three-Stage Pipeline
Headroom optimizes images in LLM chat messages through a three-stage pipeline that combines tile-boundary mathematics with a trained Mini-LM and SigLIP router to intelligently select between OCR transcoding, cropping, or low-detail compression.
The chopratejas/headroom repository implements an intelligent image compression system specifically designed for LLM API cost optimization. Headroom's image compression analyzes both the user's textual query and the visual content to determine the optimal compression technique, reducing token costs while preserving information critical to the conversation.
The Three-Stage Compression Pipeline
Headroom processes every image through three distinct stages, each handled by specialized modules in the headroom/image/ directory.
Stage 1: Tile-Boundary Optimization
The first stage applies pure mathematics to resize images onto provider-specific tile boundaries without quality loss. This reduces token counts before any ML analysis begins.
In headroom/image/tile_optimizer.py, the functions estimate_openai_tokens and estimate_anthropic_tokens calculate provider-specific costs (OpenAI uses 512px tiles, Anthropic uses approximately 750px² per token). The optimize_images_in_messages function then resizes images to optimal dimensions, returning immediate token savings (tile_saved) that require no ML inference.
Stage 2: ML-Based Technique Routing
The second stage employs a trained ML router that analyzes both the user's query intent and image characteristics to select the optimal compression strategy.
The router implementation lives in two files:
headroom/image/trained_router.py– PyTorch implementation using Mini-LM for query classification and SigLIP for image analysisheadroom/image/onnx_router.py– Production-ready ONNX fallback (~32MB classifier + ~95MB SigLIP) that runs on CPU without PyTorch dependencies
The router is lazily loaded via _get_router() in headroom/image/compressor.py only when first needed.
Stage 3: Technique Application
The final stage executes the chosen compression technique based on the router's RouteDecision. The _apply_compression method in headroom/image/compressor.py implements three provider formats (OpenAI, Anthropic, Google) for each technique:
- TRANSCODE: OCR extraction via RapidOCR (supports v1 and v3 APIs)
- CROP / FULL-LOW: Dimension-based resizing with JPEG compression
- PRESERVE: Passing the image unchanged
How the ML Router Makes Routing Decisions
The router's decision flow combines textual intent classification with visual signal extraction.
Query Intent Classification
The classify_query() method uses Mini-LM to predict a Technique enum value (TRANSCODE, CROP, FULL_LOW, or PRESERVE) along with a confidence score. The method extracts the query text by walking messages in reverse order via _extract_query(), concatenating multi-part blocks if necessary.
Image Signal Analysis
When use_siglip is enabled, analyze_image() extracts four critical signals:
has_text– Presence of readable textis_document– Document-like structureis_complex– Visual complexityhas_small_details– Fine detail presence
These signals adjust the confidence scores. For example, the router lowers confidence for TRANSCODE when SigLIP detects no text in the image, preventing wasted OCR attempts.
Final Route Decision
The RouteDecision object contains the chosen technique, confidence score, reasoning string, and raw image signals. This decision drives whether the system runs OCR, resizes the image, or preserves quality based on the query's needs.
Compression Techniques Explained
Each technique in headroom/image/compressor.py serves specific cost-optimization scenarios.
Transcode
The TRANSCODE technique runs RapidOCR via _ocr_extract() to convert images containing text into text blocks. Upon successful extraction, the image is replaced with a [OCR from image] text block. This eliminates image token costs entirely when the user needs only the text content.
Crop and Full-Low
For CROP or FULL_LOW decisions:
- OpenAI implementations set
detail: "low"in the message block - Anthropic and Google implementations use
_resize_image()to create low-detail JPEGs with maximum dimension constraints
This balances token savings against the need for visual understanding.
Preserve
The PRESERVE technique passes images unchanged when the router determines that full detail is necessary for the query (e.g., "describe this complex diagram in detail").
Implementation Code Examples
Basic Usage with ImageCompressor
from headroom.image import ImageCompressor
messages = [
{
"role": "user",
"content": [
{"type": "text", "text": "What does this diagram show?"},
{
"type": "image_url",
"image_url": {
"url": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUg..."
},
},
],
}
]
compressor = ImageCompressor()
compressed = compressor.compress(messages, provider="openai")
print("Saved:", compressor.last_savings, "%")
Source: [headroom/image/compressor.py](https://github.com/chopratejas/headroom/blob/main/headroom/image/compressor.py)
Convenience Function
from headroom.image import compress_images
compressed = compress_images(messages, provider="anthropic")
Source: [headroom/image/compressor.py](https://github.com/chopratejas/headroom/blob/main/headroom/image/compressor.py)
Direct Router Invocation
from headroom.image.trained_router import TrainedRouter, Technique
router = TrainedRouter(use_siglip=True) # Loads Mini-LM + SigLIP
with open("my_photo.png", "rb") as f:
img_bytes = f.read()
decision = router.classify(img_bytes, "extract the text")
print(decision.technique, decision.confidence, decision.reason)
Source: [headroom/image/trained_router.py](https://github.com/chopratejas/headroom/blob/main/headroom/image/trained_router.py)
Summary
- Headroom's image compression uses a three-stage pipeline: tile optimization, ML routing, and technique application.
- The trained ML router combines Mini-LM for query intent and SigLIP for image analysis to select between transcoding, cropping, or preserving images.
- Implementation files include
headroom/image/compressor.pyfor orchestration,trained_router.pyfor PyTorch inference, andonnx_router.pyfor lightweight CPU-only production deployment. - The router makes query-aware decisions that override naive compression, ensuring OCR only runs when text is detected and low-detail mode only activates when visual fidelity is unnecessary.
- Token accounting via
_count_result_tokens()provides measurable savings percentages throughCompressionResultobjects.
Frequently Asked Questions
What machine learning models power Headroom's image compression router?
The router uses a Mini-LM sentence transformer for classifying user query intent and SigLIP (Sigmoid Loss for Language Image Pre-Training) for analyzing image content. These models detect whether images contain text, represent documents, or contain complex details that require high-resolution preservation.
How does Headroom choose between PyTorch and ONNX inference?
The system attempts to load the ONNX router (onnx_router.py) by default for production deployments, as it requires only ~127MB of model weights and runs efficiently on CPU. If ONNX loading fails, or if the test suite has monkey-patched _get_router(), the system falls back to the PyTorch router (trained_router.py) which requires full PyTorch dependencies.
What are the token savings from Headroom's compression techniques?
Savings vary by technique and provider. Tile-boundary optimization provides immediate mathematical savings by forcing images onto provider-specific grids (OpenAI's 512px tiles or Anthropic's density calculations). Transcoding provides maximum savings by replacing images entirely with text tokens. The CompressionResult object exposes savings_percent and original/compressed token counts for monitoring.
Does the ML router work offline after initial download?
Yes. Both the PyTorch and ONNX routers lazy-load model weights from HuggingFace on first invocation, then cache them locally. Once downloaded, all inference runs completely offline. The ONNX runtime specifically uses headroom/image/onnx_runtime.py to handle model downloading and session creation for air-gapped environments.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →