How SmartCrusher Differentiates Compression for JSON Arrays vs Nested Objects
SmartCrusher applies row-dropping lossy compression to top-level JSON arrays via crush_array_json and schema-preserving document walking to nested objects via compact_document_json, mirroring both results into a Python CCR store for retrieval.
SmartCrusher is the Rust-backed compression engine inside the chopratejas/headroom repository that adapts its strategy to the shape of the incoming JSON payload. When it receives a plain array, it can drop rows to enforce token limits while preserving a retrievable reference. When it receives a nested object, it walks the full document to compress inner tabular arrays and large scalar fields without breaking the outer schema. Understanding how SmartCrusher differentiates compression for JSON arrays versus nested objects is essential for tuning headroom transforms correctly.
Array-First Strategy: Row Dropping with crush_array_json
When the payload is a top-level JSON array, SmartCrusher treats each element as an independent record. This design aligns with typical LLM tool output that returns lists of events, logs, or rows where removing individual entries does not violate the overall schema.
How the Rust Core Processes Top-Level Arrays
The Python method crush_array_json in headroom/transforms/smart_crusher.py (lines 325-352) forwards the raw JSON string directly to the Rust core:
result = self._rust.crush_array_json(items_json, query, bias)
The Rust method crush_array_json in crates/headroom-core/src/lib.rs decides whether to keep the entire array losslessly or switch to lossy mode and drop rows. If rows are dropped, the engine injects a CCR marker such as <<ccr:HASH 20_rows_offloaded>> into the result and stores the original array in the CCR store.
After Rust returns, the Python wrapper calls _mirror_ccr_to_python_store to copy the hash into the Python compression_store. This makes the offloaded data resolvable by the /v1/retrieve endpoint or through direct ccr_get calls.
from headroom.transforms.smart_crusher import SmartCrusher, SmartCrusherConfig
import json
crusher = SmartCrusher(SmartCrusherConfig(), with_compaction=False) # lossless-first disabled
items = [{"id": i, "status": "ok"} for i in range(60)]
payload = json.dumps(items)
result = crusher.crush_array_json(payload)
print("Compressed JSON:", result["items"])
print("CCR marker:", result["dropped_summary"])
# Retrieve the full original array later
full = json.loads(crusher.ccr_get(result["ccr_hash"]))
In this example, the resulting items field contains a subset of the original rows. The missing rows remain accessible through the CCR hash.
Document-First Strategy: Schema-Preserving Walk with compact_document_json
When the payload is a JSON object, dropping entire top-level keys would break the expected schema. SmartCrusher therefore uses a document walker that keeps the outer structure intact while targeting inner complexity.
How the Rust Core Walks Nested Objects
The Python method compact_document_json in headroom/transforms/smart_crusher.py (lines 374-385) invokes the Rust document walker:
result = self._rust.compact_document_json(doc_json)
The Rust implementation in crates/headroom-core/src/lib.rs traverses the entire JSON document. It compacts sub-arrays into CSV-style strings, replaces long opaque blobs with CCR markers, and may apply lossless compaction to individual objects. Any CCR markers emitted for large fields are mirrored into the Python store, and the transformed document is returned as a single JSON string.
This behavior is exercised in the test suites tests/test_transforms/test_smart_crusher_bugs.py and tests/test_transforms/test_smart_crusher_ccr_roundtrip.py. These tests verify document-walk integrity and ensure that CCR roundtrip retrieval works correctly for both nested arrays and large blob fields.
Compacting Sub-Arrays and Large Fields
Because the outer object schema is preserved, the document walker only compresses inner arrays or oversized scalar values. A nested tabular structure such as an events list is flattened into a CSV-style representation, while a long blob field is replaced by a retrievable CCR marker.
from headroom.transforms.smart_crusher import SmartCrusher
import json, re
crusher = SmartCrusher()
doc = {
"events": [{"id": i, "action": "click"} for i in range(30)],
"metadata": {"author": "alice"},
"blob": "A" * 2000 # long field triggers CCR
}
payload = json.dumps(doc)
compressed = crusher.compact_document_json(payload)
doc_obj = json.loads(compressed)
# The blob field is replaced by a CCR marker
m = re.search(r"<<ccr:([0-9a-f]+),", doc_obj["blob"])
original_blob = crusher.ccr_get(m.group(1))
Here the events array is compacted, and the oversized blob field is replaced by a marker that resolves back to the original string.
Key Differences Between Array and Document Compression
Both paths rely on the Rust core but make different trade-offs based on payload shape. The distinction determines where lossy row dropping is permitted:
crush_array_jsontreats the payload as a list of independent records. The Rust core may drop rows to stay under token limits and stash the full original array behind a CCR hash marker.compact_document_jsontreats the payload as a structured document. The Rust core walks the tree to compress nested arrays into CSV-style strings and replaces large scalar blobs with CCR markers, but it never removes top-level objects.
Both wrappers eventually call _mirror_ccr_to_python_store so that the /v1/retrieve endpoint can resolve any offloaded content from the Python side.
Summary
- Top-level arrays are handled by
crush_array_json, which may drop rows and store the original data behind a<<ccr:HASH N_rows_offloaded>>marker. - Nested objects are handled by
compact_document_json, which performs a schema-preserving document walk that compacts inner arrays and replaces large fields with CCR markers. - Rust implementation for both paths lives in
crates/headroom-core/src/lib.rs, while the Python shims are inheadroom/transforms/smart_crusher.py. - CCR mirroring is performed by both paths through
_mirror_ccr_to_python_store, enabling later retrieval viaccr_getor the/v1/retrieveendpoint.
Frequently Asked Questions
How does SmartCrusher decide whether to use array or document compression?
SmartCrusher does not auto-detect the appropriate mode. The caller explicitly invokes either crush_array_json() for top-level JSON arrays or compact_document_json() for nested JSON objects. According to the headroom source code, these entry points live at lines 325-352 and 374-385 of headroom/transforms/smart_crusher.py, and each forwards to a distinct Rust function in crates/headroom-core/src/lib.rs.
Can SmartCrusher losslessly compress a top-level JSON array?
Yes. When with_compaction=False is passed to the SmartCrusher constructor, the Rust core attempts a lossless approach during crush_array_json. If the array still exceeds the configured threshold, the engine falls back to lossy mode, generates a <<ccr:HASH N_rows_offloaded>> marker, and stores the original data in the CCR store for retrieval.
What happens to nested arrays inside a JSON object during compression?
During compact_document_json, the Rust document walker compacts nested tabular arrays into CSV-style strings while preserving the outer object schema. It does not drop objects from the top level. It only targets inner arrays or oversized scalar blobs that exceed size thresholds.
How is compressed data retrieved after SmartCrusher processing?
Both crush_array_json and compact_document_json call _mirror_ccr_to_python_store to copy CCR entries from the Rust core into the Python compression_store. You can resolve the original payload later by calling crusher.ccr_get(hash) or through the /v1/retrieve endpoint.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →