How Headroom Handles Mixed Content with Code, JSON, and Prose: A Technical Deep Dive

Headroom processes mixed content containing code, JSON, and prose by splitting the input into typed sections, protecting code structures with Rust-backed tag protection, applying selective compression to each content type, and reassembling the parts in their original order.

The chopratejas/headroom repository provides an open-source text compression pipeline designed specifically for LLM workflows. When your input contains a blend of code snippets, JSON arrays, and natural language prose, Headroom employs a section-based architecture that treats each content type according to its specific preservation requirements rather than applying uniform compression across the entire document.

Section Detection and ContentType Classification

Headroom begins processing by running the content_router transform, which walks the raw text line-by-line to build a list of Section objects. Each section is classified into one of three ContentType enums: CODE, JSON_ARRAY, or PROSE.

JSON Boundary Detection with Bracket Balancing

For JSON detection, Headroom uses the _extract_json_block function, which balances brackets while specifically ignoring brackets that appear inside quoted strings. This prevents malformed extraction when JSON contains string values with nested brackets. The logic is validated in tests/test_transforms_content_router.py within the test_mixed_content_section_splitting_and_json_extraction test case, ensuring that JSON arrays embedded inside longer prose blocks are isolated without corruption.


# Conceptual flow based on headroom/_core Rust implementation

sections = [
    Section(type=ContentType.PROSE, content="Here is background..."),
    Section(type=ContentType.JSON_ARRAY, content='[{"id": 1, "value": "foo"}]'),
    Section(type=ContentType.CODE, content="<my-tag>protected</my-tag>")
]

Protecting Code Structures with Tag Protection

Before any compression runs, Headroom protects custom XML tags that must remain intact. The headroom/transforms/tag_protector.py module implements a Rust-backed walker called protect_tags that replaces each tag (e.g., <my-tag>…</my-tag>) with a unique placeholder and records the mapping. This prevents the compressor from accidentally stripping or merging critical tags during aggressive token reduction.

The tag protector integrates with headroom/_core, the Rust crate containing the core detection and compression algorithms, ensuring high-performance processing even with large mixed-content inputs.

Marker-Based Transformation Tracking

Headroom inserts Headroom markers into the text to enable faithful reconstruction. Utility functions in headroom/utils.py—including create_marker and create_tool_digest_marker—generate these markers (such as <headroom:tool_digest …>) which track transformations and facilitate content restoration after compression completes.

Selective Compression Strategies

Each section type receives appropriate compression treatment:

  • JSON sections are passed to the SmartCrusher transform, which applies aggressive token-level compression while strictly preserving JSON structure and schema integrity.
  • Code sections remain untouched or receive only light compression, depending on the compress_tagged_content configuration flag.
  • Prose sections undergo standard text compression techniques like summarization and token-budget reduction.

This selective approach ensures that structured data retains its machine-parseable format while natural language receives the heavy compression typically required for LLM context windows.

Implementation Example

The following example demonstrates the complete pipeline using the HeadroomClient:

from headroom import HeadroomClient

mixed = """
Here is some background text for the user.

```json
[
  {"id": 1, "value": "foo"},
  {"id": 2, "value": "bar"}
]

And now a custom XML tag that must stay intact: important

Finally a short plain‑text instruction. """

client = HeadroomClient() compressed = client.compress(mixed) # Runs the full pipeline

print(compressed)


The output contains a heavily compressed JSON array, preserved `<my-tag>` content (protected by [`headroom/transforms/tag_protector.py`](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/tag_protector.py)), and readable surrounding prose, demonstrating how Headroom maintains the exact mix of content types the caller supplied.

## Summary

- Headroom splits mixed input into three `ContentType` categories (`CODE`, `JSON_ARRAY`, `PROSE`) via the `content_router` transform.
- The `_extract_json_block` function balances brackets while ignoring quoted strings to accurately isolate JSON within prose.
- **Tag protection** in [`headroom/transforms/tag_protector.py`](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/tag_protector.py) preserves XML structures using Rust-backed `protect_tags` walkers and placeholder substitution.
- **SmartCrusher** applies aggressive compression specifically to JSON sections while leaving code untouched or lightly compressed based on the `compress_tagged_content` flag.
- **Headroom markers** generated by [`headroom/utils.py`](https://github.com/chopratejas/headroom/blob/main/headroom/utils.py) enable precise reconstruction and section tracking throughout the pipeline.
- The `headroom/_core` Rust crate provides the underlying high-performance algorithms for content detection and processing.

## Frequently Asked Questions

### How does Headroom identify JSON arrays embedded within prose text?

Headroom uses the `_extract_json_block` function to detect JSON boundaries by balancing opening and closing brackets while specifically ignoring any brackets that appear inside quoted strings. This algorithm, tested in [`tests/test_transforms_content_router.py`](https://github.com/chopratejas/headroom/blob/main/tests/test_transforms_content_router.py), ensures that JSON arrays surrounded by natural language are extracted as distinct `JSON_ARRAY` sections without being corrupted by text that happens to contain brace characters.

### What happens to custom XML tags during the compression process?

Custom XML tags are protected before compression runs by [`headroom/transforms/tag_protector.py`](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/tag_protector.py), which utilizes a Rust-backed `protect_tags` walker to replace each tag with a unique placeholder. The mapping is stored and used to restore the original tags after all transforms complete, preventing the compressor from stripping or altering these structural elements regardless of the compression settings applied to surrounding content.

### Can I customize compression levels for specific content types in Headroom?

Yes, Headroom supports selective compression strategies through configuration flags like `compress_tagged_content`, which controls whether code sections receive compression. JSON sections are automatically routed to **SmartCrusher** for aggressive token reduction, while prose receives standard text compression, allowing each content type to be processed according to its preservation requirements and your specific token budget constraints.

### Where is the core logic for mixed content handling implemented?

The core detection and compression algorithms reside in `headroom/_core`, a Rust crate that provides high-performance implementations for bracket balancing, tag protection, and section routing. Python shims in `headroom/transforms/` and [`headroom/utils.py`](https://github.com/chopratejas/headroom/blob/main/headroom/utils.py) provide the interface layer, with the actual content type detection and JSON extraction occurring in the compiled Rust code for optimal performance.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →