How Headroom's ContentRouter Selects the Optimal Compression Strategy for Different Content Types
Headroom's ContentRouter determines the optimal compression strategy through a three-phase pipeline: detecting mixed-content documents via regex patterns, classifying pure content using a Rust detector with Python fallback, and mapping the resulting ContentType to a specific CompressionStrategy while respecting configuration overrides.
The ContentRouter in the chopratejas/headroom repository acts as an intelligent traffic controller for text compression, automatically routing source code, JSON arrays, search results, and mixed documents to specialized compressors. Understanding how it selects between strategies like CODE_AWARE, SMART_CRUSHER, and KOMPRESS is essential for optimizing token reduction across diverse content types.
The Three-Phase Strategy Selection Pipeline
The strategy selection logic is implemented in headroom/transforms/content_router.py and operates sequentially until a definitive strategy is assigned.
Phase 1: Mixed-Content Detection via Regex Analysis
Before invoking heavy classification logic, the router checks if the input contains heterogeneous content types that would benefit from section-specific compression. The is_mixed_content() function (located at lines 27‑41 in content_router.py) compiles four detection patterns:
_CODE_FENCE_PATTERN– identifies fenced code blocks (e.g., ```python)_JSON_BLOCK_START– detects JSON object/array beginnings_SEARCH_RESULT_PATTERN– recognizes structured search result formats_PROSE_PATTERN– flags natural language text
If at least two of these indicators return true, the content is classified as mixed. The router immediately returns CompressionStrategy.MIXED and delegates to _compress_mixed(), which splits the document via split_into_sections() and processes each section with its own optimal strategy. This prevents code-aware compressors from mangling surrounding prose and vice versa.
Phase 2: Content-Type Classification via Rust Detector
For non-mixed (pure) content, _determine_strategy() calls _detect_content() (lines 10‑27 in content_router.py). This function attempts classification through two layers:
-
Primary Rust Binding: Invokes
headroom._core.detect_content_type, a Rust-based detector that returns a lowercase content tag (e.g.,"source_code","json_array"). The Python layer converts this string into the correspondingContentTypeenum member. -
Regex Fallback: If the Rust layer returns
plain_text, the system invokes_regex_detect_content_type()(defined inheadroom/transforms/content_detector.py) to perform lightweight pattern matching for edge cases not covered by the native detector.
The resulting ContentType enum value—such as ContentType.SOURCE_CODE, ContentType.JSON_ARRAY, or ContentType.SEARCH_RESULTS—is then passed to the mapping layer.
Phase 3: Strategy Mapping and Configuration Overrides
The _strategy_from_detection() method (lines 27‑36 in content_router.py) implements a static dictionary mapping ContentType to CompressionStrategy:
mapping = {
ContentType.SOURCE_CODE: CompressionStrategy.CODE_AWARE,
ContentType.JSON_ARRAY: CompressionStrategy.SMART_CRUSHER,
ContentType.SEARCH_RESULTS: CompressionStrategy.SEARCH,
ContentType.BUILD_OUTPUT: CompressionStrategy.LOG,
ContentType.GIT_DIFF: CompressionStrategy.DIFF,
ContentType.HTML: CompressionStrategy.HTML,
ContentType.PLAIN_TEXT: CompressionStrategy.TEXT,
}
If the detected type exists in this mapping, the router returns the associated strategy. If absent, it falls back to self.config.fallback_strategy (defaulting to CompressionStrategy.KOMPRESS). Additionally, configuration flags like prefer_code_aware_for_code can override the mapping, forcing source code to use KOMPRESS instead of the AST-aware compressor when disabled.
How Configuration Influences Strategy Selection
The ContentRouterConfig dataclass provides granular control over the selection pipeline:
enable_code_aware: When set toFalse, disables theCODE_AWAREcompressor entirely, forcing the fallback chainCODE_AWARE → KOMPRESS.fallback_strategy: Defines the compressor used when content type detection is ambiguous or when a specific compressor is disabled.prefer_code_aware_for_code: IfTrue(default), source code routes toCODE_AWARE; ifFalse, it uses the global fallback.
These overrides are evaluated in _strategy_from_detection() after the initial mapping lookup, ensuring user preferences take precedence over automatic detection.
Practical Code Examples
Routing JSON Arrays to SmartCrusher
from headroom.transforms import ContentRouter
router = ContentRouter()
json_payload = '[{"id":1,"msg":"hello"},{"id":2,"msg":"world"}]'
result = router.compress(json_payload)
print(result.strategy_used) # ➜ CompressionStrategy.SMART_CRUSHER
print(result.compressed) # Minified JSON output
When _detect_content() identifies ContentType.JSON_ARRAY, the router automatically selects SMART_CRUSHER, which applies structural token reduction optimized for JSON.
Handling Mixed Markdown Documents
readme = """
# API Documentation
```python
def authenticate(token):
return verify(token)
Configure the endpoint using the settings above. """
result = router.compress(readme) print(result.strategy_used) # ➜ CompressionStrategy.MIXED
print(len(result.routing_log)) # Multiple RoutingDecision entries
The `is_mixed_content()` function detects both prose (`# API Documentation`) and code fences, triggering the `MIXED` strategy. The router splits the document and compresses the Python block with `CODE_AWARE` (or `KOMPRESS` if disabled) and the markdown with `TEXT`.
### Disabling Code-Aware Compression via Configuration
```python
from headroom.transforms import ContentRouter, ContentRouterConfig, CompressionStrategy
config = ContentRouterConfig(
enable_code_aware=False,
fallback_strategy=CompressionStrategy.KOMPRESS
)
router = ContentRouter(config=config)
code = "def add(a, b):\n return a + b"
result = router.compress(code)
print(result.strategy_used) # ➜ CompressionStrategy.KOMPRESS
Even though the Rust detector correctly identifies ContentType.SOURCE_CODE, the configuration override forces the router to use KOMPRESS instead of the AST-aware compressor.
Key Source Files and Architecture
| File | Responsibility |
|---|---|
headroom/transforms/content_router.py |
Contains ContentRouter class, is_mixed_content(), _detect_content(), _determine_strategy(), and _strategy_from_detection(). |
headroom/transforms/content_detector.py |
Defines the ContentType enum and _regex_detect_content_type() fallback logic. |
headroom/transforms/base.py |
Base Transform class inherited by ContentRouter for pipeline integration. |
headroom/compression/strategies/ |
Concrete implementations (e.g., code_aware.py, smart_crusher.py) referenced by the strategy enum. |
The _apply_strategy_to_content() method handles lazy loading of compressor instances and builds fallback chains (e.g., attempting CODE_AWARE before falling back to KOMPRESS if the former raises an exception).
Summary
- Mixed-content detection uses regex heuristics to identify documents containing multiple content types, routing them to the
MIXEDstrategy for section-aware processing. - Content classification relies on a Rust-based detector (
headroom._core.detect_content_type) with a Python regex fallback to assignContentTypelabels. - Strategy mapping translates content types to compressors via a static dictionary, with
ContentRouterConfigoptions enabling user overrides for disabled features or preferred fallbacks. - Graceful degradation ensures that if a specific compressor is unavailable or fails, the system falls back to
KOMPRESSor the user-definedfallback_strategy.
Frequently Asked Questions
What happens if the Rust content detector cannot identify the file type?
If the Rust detect_content_type function returns "plain_text" or an unrecognized tag, the router invokes _regex_detect_content_type() as a secondary check. If this also fails to identify a specific type, the router uses the fallback_strategy configured in ContentRouterConfig (defaulting to CompressionStrategy.KOMPRESS).
How does Headroom handle documents that contain both code and natural language?
Documents triggering multiple detection patterns (e.g., code fences alongside prose paragraphs) are flagged as mixed by is_mixed_content(). The router selects CompressionStrategy.MIXED, splits the document into isolated sections using split_into_sections(), and recursively applies the optimal strategy to each section independently before reassembling the output.
Can I force the ContentRouter to always use a specific compression strategy?
While the router is designed for automatic selection, you can effectively force a specific strategy by setting the fallback_strategy in your configuration and disabling all specialized compressors (e.g., enable_code_aware=False, enable_smart_crusher=False). However, for production use, it is recommended to let the router select strategies while tuning via prefer_code_aware_for_code and similar flags.
What compression strategy does Headroom use for unknown content types?
For content types not present in the static mapping (such as binary data or unrecognized markup), the router defaults to the fallback_strategy specified in the configuration. By default, this is CompressionStrategy.KOMPRESS, a general-purpose compressor designed to handle arbitrary text safely when specialized strategies are unavailable.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →