# How Headroom's ContentRouter Identifies Content Types for Optimal Compression

> Discover how Headroom's ContentRouter analyzes content types for optimal compression. Learn about its three-phase pipeline including heuristics and regex for efficient data handling.

- Repository: [Tejas Chopra/headroom](https://github.com/chopratejas/headroom)
- Tags: how-to-guide
- Published: 2026-06-07

---

**Headroom's ContentRouter analyzes input through a three-phase pipeline that detects mixed-content boundaries, classifies pure content via Rust-based heuristics or regex fallbacks, and maps detected types to specialized compression strategies using a static configuration dictionary.**

Headroom is an open-source text compression framework that optimizes token usage by routing different content types to specialized compressors. The **ContentRouter** serves as the central decision engine in [`headroom/transforms/content_router.py`](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/content_router.py), determining whether input contains heterogeneous document sections or uniform content before selecting the optimal compression strategy.

## The Three-Phase Detection Pipeline

The ContentRouter implements a sequential decision process through the `compress()` and `_determine_strategy()` methods. This architecture separates mixed-content detection from pure-type classification and strategy mapping.

### Phase 1: Mixed-Content Boundary Detection

The router first invokes `is_mixed_content()` to identify heterogeneous documents. This function scans input using four compiled regex patterns: `_CODE_FENCE_PATTERN` for code blocks, `_JSON_BLOCK_START` for JSON arrays, `_SEARCH_RESULT_PATTERN` for search results, and `_PROSE_PATTERN` for natural language text. When at least two indicators are present, the router selects `CompressionStrategy.MIXED` and delegates to `_compress_mixed()`, which splits the document using `split_into_sections()` and compresses each section with its own optimal strategy.

### Phase 2: Content-Type Classification

For uniform content, `_detect_content()` invokes the Rust-based `headroom._core.detect_content_type` binding. The Rust detector returns structural classifications that the Python layer converts to `ContentType` enum values including `SOURCE_CODE`, `JSON_ARRAY`, `SEARCH_RESULTS`, `BUILD_OUTPUT`, `GIT_DIFF`, `HTML`, and `PLAIN_TEXT`. If the Rust layer returns `plain_text`, the system falls back to `_regex_detect_content_type()` for lightweight pattern matching against specific formats like diff output or build logs.

### Phase 3: Strategy Mapping and Configuration

The `_strategy_from_detection()` method translates `ContentType` to `CompressionStrategy` via a static mapping dictionary. Available strategies include `CODE_AWARE`, `SMART_CRUSHER`, `SEARCH`, `LOG`, `DIFF`, `HTML`, `TEXT`, and `KOMPRESS`. Configuration flags such as `prefer_code_aware_for_code` and `fallback_strategy` (defaulting to `KOMPRESS`) allow runtime customization. If the detected type lacks a specific mapping, the router uses `self.config.fallback_strategy`.

## Content Detection Implementation

The detection system combines high-performance Rust analysis with Python regex fallbacks defined in [`headroom/transforms/content_detector.py`](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/content_detector.py).

### Rust-Based Structural Analysis

The native detector analyzes token patterns and structural signatures to distinguish between code, JSON, HTML, and prose. This processing occurs in the `_detect_content()` method, which handles the conversion between Rust string outputs and Python enum members.

### Regex Fallback Patterns

When the Rust detector identifies generic plaintext, the router applies regex heuristics to detect specialized formats. These patterns identify content that might benefit from strategy-specific compression despite lacking explicit structural markers detectable by the Rust layer.

## Strategy Selection and Fallback Chains

The ContentRouter implements defensive fallback chains to ensure robust compression. The `_apply_strategy_to_content()` method lazy-loads compressor instances and builds fallback sequences: if `CODE_AWARE` is requested but disabled via `ContentRouterConfig`, the router cascades to `KOMPRESS`. Similarly, `SMART_CRUSHER` falls back to `KOMPRESS` when unavailable. Final strategy decisions are recorded in `RouterCompressionResult` objects, with optional telemetry logging via `_record_to_toin()`.

## Practical Usage Examples

Route JSON arrays automatically to the SmartCrusher compressor:

```python
from headroom.transforms import ContentRouter

router = ContentRouter()
json_payload = '[{"id": 1, "msg": "hello"}, {"id": 2, "msg": "world"}]'
result = router.compress(json_payload)
print(result.strategy_used)  # CompressionStrategy.SMART_CRUSHER

```

Handle mixed-content documents containing code and prose:

```python
mixed_doc = """# Documentation

```python
def example():
    return True

```

This function returns boolean values.
"""
result = router.compress(mixed_doc)
print(result.strategy_used)  # CompressionStrategy.MIXED

print(result.routing_log)    # List of per-section routing decisions

```

Override default strategy selection with custom configuration:

```python
from headroom.transforms import ContentRouter, ContentRouterConfig, CompressionStrategy

config = ContentRouterConfig(
    enable_code_aware=False,
    fallback_strategy=CompressionStrategy.TEXT,
    prefer_code_aware_for_code=False
)
router = ContentRouter(config=config)
code_result = router.compress("def add(a, b): return a + b")
print(code_result.strategy_used)  # CompressionStrategy.KOMPRESS

```

## Summary

- **Headroom's ContentRouter** implements a three-phase pipeline in [`headroom/transforms/content_router.py`](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/content_router.py) to analyze content structure before compression.
- **Mixed-content detection** uses regex patterns (`_CODE_FENCE_PATTERN`, `_JSON_BLOCK_START`, `_SEARCH_RESULT_PATTERN`, `_PROSE_PATTERN`) to identify documents containing multiple content types, triggering section-based compression via `CompressionStrategy.MIXED`.
- **Pure-content classification** relies on Rust-based heuristics (`headroom._core.detect_content_type`) with Python regex fallbacks to categorize input into `ContentType` enum values.
- **Strategy mapping** converts detected types to optimal compressors through a static dictionary, with configurable fallback chains defaulting to `KOMPRESS` when specific strategies are disabled or unavailable.

## Frequently Asked Questions

### How does ContentRouter detect mixed-content documents?

The router calls `is_mixed_content()` to scan for four distinct indicators using compiled regex patterns: code fences, JSON blocks, search results, and prose sections. When at least two patterns match, the content is classified as mixed and routed to `_compress_mixed()` for section-by-section processing with `CompressionStrategy.MIXED`.

### What happens when the Rust content detector returns plain text?

If `headroom._core.detect_content_type` returns `plain_text`, the `_detect_content()` method falls back to `_regex_detect_content_type()` to identify specific formats like build logs or diff output that might benefit from specialized compression strategies despite lacking strong structural markers.

### Can I force ContentRouter to use a specific compression strategy?

Yes. You can instantiate `ContentRouter` with a custom `ContentRouterConfig` that specifies `fallback_strategy` or disables specific compressors via flags like `enable_code_aware` and `enable_smart_crusher`. When disabled strategies are requested, the router automatically cascades to the configured fallback.

### Which compression strategy does Headroom use as the default fallback?

According to the source code in [`headroom/transforms/content_router.py`](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/content_router.py), the default `fallback_strategy` is `CompressionStrategy.KOMPRESS`. This serves as the final fallback when content type detection yields no specific mapping or when requested strategies are disabled.