How Headroom's ContentRouter Identifies Content Types for Optimal Compression

Headroom's ContentRouter analyzes input through a three-phase pipeline that detects mixed-content boundaries, classifies pure content via Rust-based heuristics or regex fallbacks, and maps detected types to specialized compression strategies using a static configuration dictionary.

Headroom is an open-source text compression framework that optimizes token usage by routing different content types to specialized compressors. The ContentRouter serves as the central decision engine in headroom/transforms/content_router.py, determining whether input contains heterogeneous document sections or uniform content before selecting the optimal compression strategy.

The Three-Phase Detection Pipeline

The ContentRouter implements a sequential decision process through the compress() and _determine_strategy() methods. This architecture separates mixed-content detection from pure-type classification and strategy mapping.

Phase 1: Mixed-Content Boundary Detection

The router first invokes is_mixed_content() to identify heterogeneous documents. This function scans input using four compiled regex patterns: _CODE_FENCE_PATTERN for code blocks, _JSON_BLOCK_START for JSON arrays, _SEARCH_RESULT_PATTERN for search results, and _PROSE_PATTERN for natural language text. When at least two indicators are present, the router selects CompressionStrategy.MIXED and delegates to _compress_mixed(), which splits the document using split_into_sections() and compresses each section with its own optimal strategy.

Phase 2: Content-Type Classification

For uniform content, _detect_content() invokes the Rust-based headroom._core.detect_content_type binding. The Rust detector returns structural classifications that the Python layer converts to ContentType enum values including SOURCE_CODE, JSON_ARRAY, SEARCH_RESULTS, BUILD_OUTPUT, GIT_DIFF, HTML, and PLAIN_TEXT. If the Rust layer returns plain_text, the system falls back to _regex_detect_content_type() for lightweight pattern matching against specific formats like diff output or build logs.

Phase 3: Strategy Mapping and Configuration

The _strategy_from_detection() method translates ContentType to CompressionStrategy via a static mapping dictionary. Available strategies include CODE_AWARE, SMART_CRUSHER, SEARCH, LOG, DIFF, HTML, TEXT, and KOMPRESS. Configuration flags such as prefer_code_aware_for_code and fallback_strategy (defaulting to KOMPRESS) allow runtime customization. If the detected type lacks a specific mapping, the router uses self.config.fallback_strategy.

Content Detection Implementation

The detection system combines high-performance Rust analysis with Python regex fallbacks defined in headroom/transforms/content_detector.py.

Rust-Based Structural Analysis

The native detector analyzes token patterns and structural signatures to distinguish between code, JSON, HTML, and prose. This processing occurs in the _detect_content() method, which handles the conversion between Rust string outputs and Python enum members.

Regex Fallback Patterns

When the Rust detector identifies generic plaintext, the router applies regex heuristics to detect specialized formats. These patterns identify content that might benefit from strategy-specific compression despite lacking explicit structural markers detectable by the Rust layer.

Strategy Selection and Fallback Chains

The ContentRouter implements defensive fallback chains to ensure robust compression. The _apply_strategy_to_content() method lazy-loads compressor instances and builds fallback sequences: if CODE_AWARE is requested but disabled via ContentRouterConfig, the router cascades to KOMPRESS. Similarly, SMART_CRUSHER falls back to KOMPRESS when unavailable. Final strategy decisions are recorded in RouterCompressionResult objects, with optional telemetry logging via _record_to_toin().

Practical Usage Examples

Route JSON arrays automatically to the SmartCrusher compressor:

from headroom.transforms import ContentRouter

router = ContentRouter()
json_payload = '[{"id": 1, "msg": "hello"}, {"id": 2, "msg": "world"}]'
result = router.compress(json_payload)
print(result.strategy_used)  # CompressionStrategy.SMART_CRUSHER

Handle mixed-content documents containing code and prose:

mixed_doc = """# Documentation

```python
def example():
    return True

This function returns boolean values. """ result = router.compress(mixed_doc) print(result.strategy_used) # CompressionStrategy.MIXED

print(result.routing_log) # List of per-section routing decisions


Override default strategy selection with custom configuration:

```python
from headroom.transforms import ContentRouter, ContentRouterConfig, CompressionStrategy

config = ContentRouterConfig(
    enable_code_aware=False,
    fallback_strategy=CompressionStrategy.TEXT,
    prefer_code_aware_for_code=False
)
router = ContentRouter(config=config)
code_result = router.compress("def add(a, b): return a + b")
print(code_result.strategy_used)  # CompressionStrategy.KOMPRESS

Summary

  • Headroom's ContentRouter implements a three-phase pipeline in headroom/transforms/content_router.py to analyze content structure before compression.
  • Mixed-content detection uses regex patterns (_CODE_FENCE_PATTERN, _JSON_BLOCK_START, _SEARCH_RESULT_PATTERN, _PROSE_PATTERN) to identify documents containing multiple content types, triggering section-based compression via CompressionStrategy.MIXED.
  • Pure-content classification relies on Rust-based heuristics (headroom._core.detect_content_type) with Python regex fallbacks to categorize input into ContentType enum values.
  • Strategy mapping converts detected types to optimal compressors through a static dictionary, with configurable fallback chains defaulting to KOMPRESS when specific strategies are disabled or unavailable.

Frequently Asked Questions

How does ContentRouter detect mixed-content documents?

The router calls is_mixed_content() to scan for four distinct indicators using compiled regex patterns: code fences, JSON blocks, search results, and prose sections. When at least two patterns match, the content is classified as mixed and routed to _compress_mixed() for section-by-section processing with CompressionStrategy.MIXED.

What happens when the Rust content detector returns plain text?

If headroom._core.detect_content_type returns plain_text, the _detect_content() method falls back to _regex_detect_content_type() to identify specific formats like build logs or diff output that might benefit from specialized compression strategies despite lacking strong structural markers.

Can I force ContentRouter to use a specific compression strategy?

Yes. You can instantiate ContentRouter with a custom ContentRouterConfig that specifies fallback_strategy or disables specific compressors via flags like enable_code_aware and enable_smart_crusher. When disabled strategies are requested, the router automatically cascades to the configured fallback.

Which compression strategy does Headroom use as the default fallback?

According to the source code in headroom/transforms/content_router.py, the default fallback_strategy is CompressionStrategy.KOMPRESS. This serves as the final fallback when content type detection yields no specific mapping or when requested strategies are disabled.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →