# How SmartCrusher Differentiates Compression for JSON Arrays vs Nested Objects

> SmartCrusher compresses JSON arrays and nested objects differently. Discover how row-dropping and schema-preserving methods optimize your data storage and retrieval.

- Repository: [Tejas Chopra/headroom](https://github.com/chopratejas/headroom)
- Tags: deep-dive
- Published: 2026-06-07

---

**SmartCrusher applies row-dropping lossy compression to top-level JSON arrays via `crush_array_json` and schema-preserving document walking to nested objects via `compact_document_json`, mirroring both results into a Python CCR store for retrieval.**

SmartCrusher is the Rust-backed compression engine inside the `chopratejas/headroom` repository that adapts its strategy to the shape of the incoming JSON payload. When it receives a plain array, it can drop rows to enforce token limits while preserving a retrievable reference. When it receives a nested object, it walks the full document to compress inner tabular arrays and large scalar fields without breaking the outer schema. Understanding how SmartCrusher differentiates compression for JSON arrays versus nested objects is essential for tuning headroom transforms correctly.

## Array-First Strategy: Row Dropping with `crush_array_json`

When the payload is a top-level JSON array, SmartCrusher treats each element as an independent record. This design aligns with typical LLM tool output that returns lists of events, logs, or rows where removing individual entries does not violate the overall schema.

### How the Rust Core Processes Top-Level Arrays

The Python method `crush_array_json` in [`headroom/transforms/smart_crusher.py`](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/smart_crusher.py) (lines 325-352) forwards the raw JSON string directly to the Rust core:

```python
result = self._rust.crush_array_json(items_json, query, bias)

```

The Rust method `crush_array_json` in [`crates/headroom-core/src/lib.rs`](https://github.com/chopratejas/headroom/blob/main/crates/headroom-core/src/lib.rs) decides whether to keep the entire array losslessly or switch to lossy mode and drop rows. If rows are dropped, the engine injects a **CCR marker** such as `<<ccr:HASH 20_rows_offloaded>>` into the result and stores the original array in the CCR store.

After Rust returns, the Python wrapper calls `_mirror_ccr_to_python_store` to copy the hash into the Python `compression_store`. This makes the offloaded data resolvable by the `/v1/retrieve` endpoint or through direct `ccr_get` calls.

```python
from headroom.transforms.smart_crusher import SmartCrusher, SmartCrusherConfig
import json

crusher = SmartCrusher(SmartCrusherConfig(), with_compaction=False)  # lossless-first disabled

items = [{"id": i, "status": "ok"} for i in range(60)]
payload = json.dumps(items)

result = crusher.crush_array_json(payload)
print("Compressed JSON:", result["items"])
print("CCR marker:", result["dropped_summary"])

# Retrieve the full original array later

full = json.loads(crusher.ccr_get(result["ccr_hash"]))

```

In this example, the resulting `items` field contains a subset of the original rows. The missing rows remain accessible through the CCR hash.

## Document-First Strategy: Schema-Preserving Walk with `compact_document_json`

When the payload is a JSON object, dropping entire top-level keys would break the expected schema. SmartCrusher therefore uses a document walker that keeps the outer structure intact while targeting inner complexity.

### How the Rust Core Walks Nested Objects

The Python method `compact_document_json` in [`headroom/transforms/smart_crusher.py`](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/smart_crusher.py) (lines 374-385) invokes the Rust document walker:

```python
result = self._rust.compact_document_json(doc_json)

```

The Rust implementation in [`crates/headroom-core/src/lib.rs`](https://github.com/chopratejas/headroom/blob/main/crates/headroom-core/src/lib.rs) traverses the entire JSON document. It compacts **sub-arrays** into CSV-style strings, replaces long opaque blobs with CCR markers, and may apply lossless compaction to individual objects. Any CCR markers emitted for large fields are mirrored into the Python store, and the transformed document is returned as a single JSON string.

This behavior is exercised in the test suites [`tests/test_transforms/test_smart_crusher_bugs.py`](https://github.com/chopratejas/headroom/blob/main/tests/test_transforms/test_smart_crusher_bugs.py) and [`tests/test_transforms/test_smart_crusher_ccr_roundtrip.py`](https://github.com/chopratejas/headroom/blob/main/tests/test_transforms/test_smart_crusher_ccr_roundtrip.py). These tests verify document-walk integrity and ensure that CCR roundtrip retrieval works correctly for both nested arrays and large blob fields.

### Compacting Sub-Arrays and Large Fields

Because the outer object schema is preserved, the document walker only compresses inner arrays or oversized scalar values. A nested tabular structure such as an `events` list is flattened into a CSV-style representation, while a long `blob` field is replaced by a retrievable CCR marker.

```python
from headroom.transforms.smart_crusher import SmartCrusher
import json, re

crusher = SmartCrusher()
doc = {
    "events": [{"id": i, "action": "click"} for i in range(30)],
    "metadata": {"author": "alice"},
    "blob": "A" * 2000  # long field triggers CCR

}
payload = json.dumps(doc)

compressed = crusher.compact_document_json(payload)
doc_obj = json.loads(compressed)

# The blob field is replaced by a CCR marker

m = re.search(r"<<ccr:([0-9a-f]+),", doc_obj["blob"])
original_blob = crusher.ccr_get(m.group(1))

```

Here the `events` array is compacted, and the oversized `blob` field is replaced by a marker that resolves back to the original string.

## Key Differences Between Array and Document Compression

Both paths rely on the Rust core but make different trade-offs based on payload shape. The distinction determines where lossy row dropping is permitted:

- **`crush_array_json`** treats the payload as a list of independent records. The Rust core may drop rows to stay under token limits and stash the full original array behind a CCR hash marker.
- **`compact_document_json`** treats the payload as a structured document. The Rust core walks the tree to compress nested arrays into CSV-style strings and replaces large scalar blobs with CCR markers, but it never removes top-level objects.

Both wrappers eventually call `_mirror_ccr_to_python_store` so that the `/v1/retrieve` endpoint can resolve any offloaded content from the Python side.

## Summary

- **Top-level arrays** are handled by `crush_array_json`, which may drop rows and store the original data behind a `<<ccr:HASH N_rows_offloaded>>` marker.
- **Nested objects** are handled by `compact_document_json`, which performs a schema-preserving document walk that compacts inner arrays and replaces large fields with CCR markers.
- **Rust implementation** for both paths lives in [`crates/headroom-core/src/lib.rs`](https://github.com/chopratejas/headroom/blob/main/crates/headroom-core/src/lib.rs), while the Python shims are in [`headroom/transforms/smart_crusher.py`](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/smart_crusher.py).
- **CCR mirroring** is performed by both paths through `_mirror_ccr_to_python_store`, enabling later retrieval via `ccr_get` or the `/v1/retrieve` endpoint.

## Frequently Asked Questions

### How does SmartCrusher decide whether to use array or document compression?

SmartCrusher does not auto-detect the appropriate mode. The caller explicitly invokes either `crush_array_json()` for top-level JSON arrays or `compact_document_json()` for nested JSON objects. According to the `headroom` source code, these entry points live at lines 325-352 and 374-385 of [`headroom/transforms/smart_crusher.py`](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/smart_crusher.py), and each forwards to a distinct Rust function in [`crates/headroom-core/src/lib.rs`](https://github.com/chopratejas/headroom/blob/main/crates/headroom-core/src/lib.rs).

### Can SmartCrusher losslessly compress a top-level JSON array?

Yes. When `with_compaction=False` is passed to the `SmartCrusher` constructor, the Rust core attempts a lossless approach during `crush_array_json`. If the array still exceeds the configured threshold, the engine falls back to lossy mode, generates a `<<ccr:HASH N_rows_offloaded>>` marker, and stores the original data in the CCR store for retrieval.

### What happens to nested arrays inside a JSON object during compression?

During `compact_document_json`, the Rust document walker compacts nested tabular arrays into CSV-style strings while preserving the outer object schema. It does not drop objects from the top level. It only targets inner arrays or oversized scalar blobs that exceed size thresholds.

### How is compressed data retrieved after SmartCrusher processing?

Both `crush_array_json` and `compact_document_json` call `_mirror_ccr_to_python_store` to copy CCR entries from the Rust core into the Python `compression_store`. You can resolve the original payload later by calling `crusher.ccr_get(hash)` or through the `/v1/retrieve` endpoint.