# How Headroom Handles Mixed Content with Code, JSON, and Prose: A Technical Deep Dive

> Discover how Headroom expertly manages mixed content including code JSON and prose. Learn about its unique segmentation protection compression and reassembly techniques for seamless processing.

- Repository: [Tejas Chopra/headroom](https://github.com/chopratejas/headroom)
- Tags: deep-dive
- Published: 2026-06-06

---

**Headroom processes mixed content containing code, JSON, and prose by splitting the input into typed sections, protecting code structures with Rust-backed tag protection, applying selective compression to each content type, and reassembling the parts in their original order.**

The `chopratejas/headroom` repository provides an open-source text compression pipeline designed specifically for LLM workflows. When your input contains a blend of code snippets, JSON arrays, and natural language prose, Headroom employs a section-based architecture that treats each content type according to its specific preservation requirements rather than applying uniform compression across the entire document.

## Section Detection and ContentType Classification

Headroom begins processing by running the `content_router` transform, which walks the raw text line-by-line to build a list of `Section` objects. Each section is classified into one of three `ContentType` enums: `CODE`, `JSON_ARRAY`, or `PROSE`.

### JSON Boundary Detection with Bracket Balancing

For JSON detection, Headroom uses the `_extract_json_block` function, which balances brackets while specifically **ignoring brackets that appear inside quoted strings**. This prevents malformed extraction when JSON contains string values with nested brackets. The logic is validated in [`tests/test_transforms_content_router.py`](https://github.com/chopratejas/headroom/blob/main/tests/test_transforms_content_router.py) within the `test_mixed_content_section_splitting_and_json_extraction` test case, ensuring that JSON arrays embedded inside longer prose blocks are isolated without corruption.

```python

# Conceptual flow based on headroom/_core Rust implementation

sections = [
    Section(type=ContentType.PROSE, content="Here is background..."),
    Section(type=ContentType.JSON_ARRAY, content='[{"id": 1, "value": "foo"}]'),
    Section(type=ContentType.CODE, content="<my-tag>protected</my-tag>")
]

```

## Protecting Code Structures with Tag Protection

Before any compression runs, Headroom protects custom XML tags that must remain intact. The [`headroom/transforms/tag_protector.py`](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/tag_protector.py) module implements a Rust-backed walker called `protect_tags` that replaces each tag (e.g., `<my-tag>…</my-tag>`) with a unique placeholder and records the mapping. This prevents the compressor from accidentally stripping or merging critical tags during aggressive token reduction.

The tag protector integrates with `headroom/_core`, the Rust crate containing the core detection and compression algorithms, ensuring high-performance processing even with large mixed-content inputs.

## Marker-Based Transformation Tracking

Headroom inserts **Headroom markers** into the text to enable faithful reconstruction. Utility functions in [`headroom/utils.py`](https://github.com/chopratejas/headroom/blob/main/headroom/utils.py)—including `create_marker` and `create_tool_digest_marker`—generate these markers (such as `<headroom:tool_digest …>`) which track transformations and facilitate content restoration after compression completes.

## Selective Compression Strategies

Each section type receives appropriate compression treatment:

- **JSON sections** are passed to the **SmartCrusher** transform, which applies aggressive token-level compression while strictly preserving JSON structure and schema integrity.
- **Code sections** remain untouched or receive only light compression, depending on the `compress_tagged_content` configuration flag.
- **Prose sections** undergo standard text compression techniques like summarization and token-budget reduction.

This selective approach ensures that structured data retains its machine-parseable format while natural language receives the heavy compression typically required for LLM context windows.

## Implementation Example

The following example demonstrates the complete pipeline using the `HeadroomClient`:

```python
from headroom import HeadroomClient

mixed = """
Here is some background text for the user.

```json
[
  {"id": 1, "value": "foo"},
  {"id": 2, "value": "bar"}
]

```

And now a custom XML tag that must stay intact:
<my-tag>important</my-tag>

Finally a short plain‑text instruction.
"""

client = HeadroomClient()
compressed = client.compress(mixed)          # Runs the full pipeline

print(compressed)

```

The output contains a heavily compressed JSON array, preserved `<my-tag>` content (protected by [`headroom/transforms/tag_protector.py`](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/tag_protector.py)), and readable surrounding prose, demonstrating how Headroom maintains the exact mix of content types the caller supplied.

## Summary

- Headroom splits mixed input into three `ContentType` categories (`CODE`, `JSON_ARRAY`, `PROSE`) via the `content_router` transform.
- The `_extract_json_block` function balances brackets while ignoring quoted strings to accurately isolate JSON within prose.
- **Tag protection** in [`headroom/transforms/tag_protector.py`](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/tag_protector.py) preserves XML structures using Rust-backed `protect_tags` walkers and placeholder substitution.
- **SmartCrusher** applies aggressive compression specifically to JSON sections while leaving code untouched or lightly compressed based on the `compress_tagged_content` flag.
- **Headroom markers** generated by [`headroom/utils.py`](https://github.com/chopratejas/headroom/blob/main/headroom/utils.py) enable precise reconstruction and section tracking throughout the pipeline.
- The `headroom/_core` Rust crate provides the underlying high-performance algorithms for content detection and processing.

## Frequently Asked Questions

### How does Headroom identify JSON arrays embedded within prose text?

Headroom uses the `_extract_json_block` function to detect JSON boundaries by balancing opening and closing brackets while specifically ignoring any brackets that appear inside quoted strings. This algorithm, tested in [`tests/test_transforms_content_router.py`](https://github.com/chopratejas/headroom/blob/main/tests/test_transforms_content_router.py), ensures that JSON arrays surrounded by natural language are extracted as distinct `JSON_ARRAY` sections without being corrupted by text that happens to contain brace characters.

### What happens to custom XML tags during the compression process?

Custom XML tags are protected before compression runs by [`headroom/transforms/tag_protector.py`](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/tag_protector.py), which utilizes a Rust-backed `protect_tags` walker to replace each tag with a unique placeholder. The mapping is stored and used to restore the original tags after all transforms complete, preventing the compressor from stripping or altering these structural elements regardless of the compression settings applied to surrounding content.

### Can I customize compression levels for specific content types in Headroom?

Yes, Headroom supports selective compression strategies through configuration flags like `compress_tagged_content`, which controls whether code sections receive compression. JSON sections are automatically routed to **SmartCrusher** for aggressive token reduction, while prose receives standard text compression, allowing each content type to be processed according to its preservation requirements and your specific token budget constraints.

### Where is the core logic for mixed content handling implemented?

The core detection and compression algorithms reside in `headroom/_core`, a Rust crate that provides high-performance implementations for bracket balancing, tag protection, and section routing. Python shims in `headroom/transforms/` and [`headroom/utils.py`](https://github.com/chopratejas/headroom/blob/main/headroom/utils.py) provide the interface layer, with the actual content type detection and JSON extraction occurring in the compiled Rust code for optimal performance.