deep-dive

How SmartCrusher Differentiates Compression for JSON Arrays vs Nested Objects

June 7, 2026 chopratejas/headroom ↗

SmartCrusher applies row-dropping lossy compression to top-level JSON arrays via crush_array_json and schema-preserving document walking to nested objects via compact_document_json, mirroring both results into a Python CCR store for retrieval.

SmartCrusher is the Rust-backed compression engine inside the chopratejas/headroom repository that adapts its strategy to the shape of the incoming JSON payload. When it receives a plain array, it can drop rows to enforce token limits while preserving a retrievable reference. When it receives a nested object, it walks the full document to compress inner tabular arrays and large scalar fields without breaking the outer schema. Understanding how SmartCrusher differentiates compression for JSON arrays versus nested objects is essential for tuning headroom transforms correctly.

Array-First Strategy: Row Dropping with `crush_array_json`

When the payload is a top-level JSON array, SmartCrusher treats each element as an independent record. This design aligns with typical LLM tool output that returns lists of events, logs, or rows where removing individual entries does not violate the overall schema.

How the Rust Core Processes Top-Level Arrays

The Python method crush_array_json in headroom/transforms/smart_crusher.py (lines 325-352) forwards the raw JSON string directly to the Rust core:

result = self._rust.crush_array_json(items_json, query, bias)

The Rust method crush_array_json in crates/headroom-core/src/lib.rs decides whether to keep the entire array losslessly or switch to lossy mode and drop rows. If rows are dropped, the engine injects a CCR marker such as <<ccr:HASH 20_rows_offloaded>> into the result and stores the original array in the CCR store.

After Rust returns, the Python wrapper calls _mirror_ccr_to_python_store to copy the hash into the Python compression_store. This makes the offloaded data resolvable by the /v1/retrieve endpoint or through direct ccr_get calls.

from headroom.transforms.smart_crusher import SmartCrusher, SmartCrusherConfig
import json

crusher = SmartCrusher(SmartCrusherConfig(), with_compaction=False)  # lossless-first disabled

items = [{"id": i, "status": "ok"} for i in range(60)]
payload = json.dumps(items)

result = crusher.crush_array_json(payload)
print("Compressed JSON:", result["items"])
print("CCR marker:", result["dropped_summary"])

# Retrieve the full original array later

full = json.loads(crusher.ccr_get(result["ccr_hash"]))

In this example, the resulting items field contains a subset of the original rows. The missing rows remain accessible through the CCR hash.

Document-First Strategy: Schema-Preserving Walk with `compact_document_json`

When the payload is a JSON object, dropping entire top-level keys would break the expected schema. SmartCrusher therefore uses a document walker that keeps the outer structure intact while targeting inner complexity.

How the Rust Core Walks Nested Objects

The Python method compact_document_json in headroom/transforms/smart_crusher.py (lines 374-385) invokes the Rust document walker:

result = self._rust.compact_document_json(doc_json)

The Rust implementation in crates/headroom-core/src/lib.rs traverses the entire JSON document. It compacts sub-arrays into CSV-style strings, replaces long opaque blobs with CCR markers, and may apply lossless compaction to individual objects. Any CCR markers emitted for large fields are mirrored into the Python store, and the transformed document is returned as a single JSON string.

This behavior is exercised in the test suites tests/test_transforms/test_smart_crusher_bugs.py and tests/test_transforms/test_smart_crusher_ccr_roundtrip.py. These tests verify document-walk integrity and ensure that CCR roundtrip retrieval works correctly for both nested arrays and large blob fields.

Compacting Sub-Arrays and Large Fields

Because the outer object schema is preserved, the document walker only compresses inner arrays or oversized scalar values. A nested tabular structure such as an events list is flattened into a CSV-style representation, while a long blob field is replaced by a retrievable CCR marker.

from headroom.transforms.smart_crusher import SmartCrusher
import json, re

crusher = SmartCrusher()
doc = {
    "events": [{"id": i, "action": "click"} for i in range(30)],
    "metadata": {"author": "alice"},
    "blob": "A" * 2000  # long field triggers CCR

}
payload = json.dumps(doc)

compressed = crusher.compact_document_json(payload)
doc_obj = json.loads(compressed)

# The blob field is replaced by a CCR marker

m = re.search(r"<<ccr:([0-9a-f]+),", doc_obj["blob"])
original_blob = crusher.ccr_get(m.group(1))

Here the events array is compacted, and the oversized blob field is replaced by a marker that resolves back to the original string.

Key Differences Between Array and Document Compression

Both paths rely on the Rust core but make different trade-offs based on payload shape. The distinction determines where lossy row dropping is permitted:

crush_array_json treats the payload as a list of independent records. The Rust core may drop rows to stay under token limits and stash the full original array behind a CCR hash marker.
compact_document_json treats the payload as a structured document. The Rust core walks the tree to compress nested arrays into CSV-style strings and replaces large scalar blobs with CCR markers, but it never removes top-level objects.

Both wrappers eventually call _mirror_ccr_to_python_store so that the /v1/retrieve endpoint can resolve any offloaded content from the Python side.

Summary

Top-level arrays are handled by crush_array_json, which may drop rows and store the original data behind a <<ccr:HASH N_rows_offloaded>> marker.
Nested objects are handled by compact_document_json, which performs a schema-preserving document walk that compacts inner arrays and replaces large fields with CCR markers.
Rust implementation for both paths lives in crates/headroom-core/src/lib.rs, while the Python shims are in headroom/transforms/smart_crusher.py.
CCR mirroring is performed by both paths through _mirror_ccr_to_python_store, enabling later retrieval via ccr_get or the /v1/retrieve endpoint.

Frequently Asked Questions

How does SmartCrusher decide whether to use array or document compression?

SmartCrusher does not auto-detect the appropriate mode. The caller explicitly invokes either crush_array_json() for top-level JSON arrays or compact_document_json() for nested JSON objects. According to the headroom source code, these entry points live at lines 325-352 and 374-385 of headroom/transforms/smart_crusher.py, and each forwards to a distinct Rust function in crates/headroom-core/src/lib.rs.

Can SmartCrusher losslessly compress a top-level JSON array?

Yes. When with_compaction=False is passed to the SmartCrusher constructor, the Rust core attempts a lossless approach during crush_array_json. If the array still exceeds the configured threshold, the engine falls back to lossy mode, generates a <<ccr:HASH N_rows_offloaded>> marker, and stores the original data in the CCR store for retrieval.

What happens to nested arrays inside a JSON object during compression?

During compact_document_json, the Rust document walker compacts nested tabular arrays into CSV-style strings while preserving the outer object schema. It does not drop objects from the top level. It only targets inner arrays or oversized scalar blobs that exceed size thresholds.

How is compressed data retrieved after SmartCrusher processing?

Both crush_array_json and compact_document_json call _mirror_ccr_to_python_store to copy CCR entries from the Rust core into the Python compression_store. You can resolve the original payload later by calling crusher.ccr_get(hash) or through the /v1/retrieve endpoint.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:

curl -s "https://instagit.com/install.md"

Add to your MCP client configuration:

{
  "mcpServers": {
    "instagit": {
      "command": "npx",
      "args": ["-y", "instagit@latest"]
    }
  }
}

Ask your agent:

"Use Instagit MCP to understand how chopratejas/headroom works."

Works with

Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →