# Kompress vs. SmartCrusher Compression and Latency Comparison in Headroom

> Explore the Kompress vs. SmartCrusher compression and latency comparison in Headroom. Discover which solution offers faster processing and better reduction for your data needs.

- Repository: [Tejas Chopra/headroom](https://github.com/chopratejas/headroom)
- Tags: performance
- Published: 2026-06-08

---

**SmartCrusher compresses JSON arrays in ~1 ms with 70-90% item reduction, while Kompress uses ML inference to compress plain text in 50-200 ms with 80-95% token reduction, and Headroom runs them sequentially in its TransformPipeline.**

The `chopratejas/headroom` repository provides a dual-stage compression system designed to shrink LLM prompts before they reach the model. Choosing between **Kompress** and **SmartCrusher** depends on whether your bottleneck is token cost or request latency. This guide compares their internals, benchmarks, and configuration using the actual source files from the project.

## SmartCrusher: Structural JSON Compression

**SmartCrusher** is the default, algorithmic compressor that ships enabled in every Headroom client. According to [`headroom/transforms/smart_crusher.py`](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/smart_crusher.py), it automatically inspects JSON arrays larger than 200 tokens and trims them while preserving schema-critical data.

The algorithm keeps the first and last *N* entries, retains anomalies such as errors and warnings, and preserves items that carry a high-information score. Because it never mutates the text inside individual elements, the output remains valid JSON and executes in roughly **1 ms** per array.

Benchmarks recorded in [`wiki/benchmarks.md`](https://github.com/chopratejas/headroom/blob/main/wiki/benchmarks.md) show that SmartCrusher achieves a **70-90% reduction** in array items. This makes it ideal for compressing tool outputs, logs, and other structured payloads where deterministic, sub-millisecond latency is required.

## Kompress: ML-Based Semantic Compression

**Kompress** (formerly LLMLINGUA) is an opt-in stage defined in [`headroom/transforms/kompress_compressor.py`](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/kompress_compressor.py). It is only available when you install the `headroom-ai[ml]` extra, and it runs after SmartCrusher and other structural transforms.

Kompress loads a **ModernBERT** model via ONNX, together with tokenizers and tree-sitter parsers, to score each token for semantic redundancy. Tokens that fall below a confidence threshold are stripped, and the surviving tokens are reassembled into the compressed message. This per-token classification is CPU-intensive and adds significant latency.

As documented in [`wiki/benchmarks.md`](https://github.com/chopratejas/headroom/blob/main/wiki/benchmarks.md), typical CPU inference for Kompress ranges from **50 ms to 200 ms**, with a P90 worst-case of approximately **608 ms** when including the **32 ms** request-handling overhead and **576 ms** ONNX pipeline time. In exchange, Kompress delivers an **80-95% token-level reduction** on plain text while preserving semantic meaning.

## Pipeline Order and End-to-End Latency

Both compressors are wired into the **TransformPipeline** in [`headroom/transforms/content_router.py`](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/content_router.py) through the `eager_load_compressors` entry point. When both stages are active, Headroom processes them in a fixed order:

1. **SmartCrusher** crushes JSON arrays in ~1 ms.
2. **Kompress** compresses the remaining plain-text payload in ~50-200 ms.

For a typical LLM request containing a few kilobytes of text, the combined end-to-end latency is approximately **~150 ms**. However, if your payload triggers the worst-case ONNX path, total latency can climb toward **~608 ms**, which dominates the budget for real-time applications.

## When to Use Each Compressor

Select the compressor that matches your data shape and latency constraints:

- **SmartCrusher only**: Best for deterministic JSON reduction where schema validity and negligible latency are mandatory.
- **Kompress after SmartCrusher**: Best for heavy text payloads when maximizing token savings justifies the ML inference cost.
- **Disable Kompress**: Best for real-time UIs that must keep total round-trip latency under ~50 ms.
- **Enable both**: Best for cost-optimized LLM prompting where an 80-95% token reduction outweighs the ~150 ms processing penalty.

## Enabling and Benchmarking Both Stages

You can configure both compressors through the Headroom client API. SmartCrusher requires no extra dependencies, but Kompress needs the ML extra installed.

```python

# Example: Enable both compressors in a Headroom client

from headroom import HeadroomClient, SmartCrusherConfig, KompressConfig

client = HeadroomClient(
    # SmartCrusher runs automatically on JSON arrays

    smart_crusher_config=SmartCrusherConfig(),
    # Opt-in to Kompress (requires the `headroom-ai[ml]` extra)

    kompress_config=KompressConfig(),
)

# Send a request with a long textual payload

response = client.compress(
    messages=[
        {"role": "user", "content": "..." * 5000}  # large text

    ]
)

print(response.compressed_messages)   # compressed by both stages

```

To measure the latency of each stage in isolation, instantiate the transform classes directly:

```python

# Example: Benchmark the two stages individually

from headroom.transforms.smart_crusher import SmartCrusher, SmartCrusherConfig
from headroom.transforms.kompress_compressor import KompressCompressor

# SmartCrusher only

sc = SmartCrusher(SmartCrusherConfig())
sc_stats = sc.compress(json_array=[{"id": i, "data": "x"} for i in range(1000)])
print(f"SmartCrusher kept {len(sc_stats)} items (≈1 ms)")

# Kompress only (requires the ML extra)

kc = KompressCompressor()
kc_stats = kc.compress(" ".join(["word"] * 2000))
print(f"Kompress reduced to {len(kc_stats.tokens)} tokens (≈150 ms)")

```

## Summary

- **SmartCrusher** handles JSON arrays algorithmically in [`headroom/transforms/smart_crusher.py`](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/smart_crusher.py), delivering **70-90% item reduction** in about **1 ms**.
- **Kompress** handles plain text via an ONNX ModernBERT model in [`headroom/transforms/kompress_compressor.py`](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/kompress_compressor.py), delivering **80-95% token reduction** at a cost of **50-200 ms** typical latency.
- Headroom orchestrates both stages through `eager_load_compressors` in [`headroom/transforms/content_router.py`](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/content_router.py), running SmartCrusher first and Kompress second.
- For strict latency budgets, rely on SmartCrusher alone; for maximum token savings, enable both stages and accept the ML inference overhead.

## Frequently Asked Questions

### How do SmartCrusher and Kompress differ architecturally?

SmartCrusher is a deterministic JSON array compressor implemented in [`headroom/transforms/smart_crusher.py`](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/smart_crusher.py) that trims array boundaries while preserving anomalies and schema-critical entries. Kompress is a probabilistic, ML-based token classifier implemented in [`headroom/transforms/kompress_compressor.py`](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/kompress_compressor.py) that uses a ModernBERT ONNX model to remove semantically redundant tokens from plain text.

### What latency should I expect from Kompress vs. SmartCrusher?

SmartCrusher typically adds **~1 ms** per JSON array according to [`wiki/benchmarks.md`](https://github.com/chopratejas/headroom/blob/main/wiki/benchmarks.md). Kompress adds roughly **50-200 ms** for standard CPU inference, with a P90 worst-case of approximately **608 ms** when ONNX pipeline overhead is included.

### Should I enable Kompress if my application has a strict latency budget?

No. Kompress is CPU-intensive and dominates the latency budget, so applications requiring sub-50 ms round trips should disable the ML compressor and rely solely on SmartCrusher. The structural compressor guarantees schema validity without model inference.

### How do I enable Kompress in the Headroom TransformPipeline?

Install the `headroom-ai[ml]` extra and pass `kompress_config=KompressConfig()` to `HeadroomClient`, as detailed in [`wiki/configuration.md`](https://github.com/chopratejas/headroom/blob/main/wiki/configuration.md). The pipeline wiring in [`headroom/transforms/content_router.py`](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/content_router.py) will then include Kompress inside `eager_load_compressors`, running it automatically after SmartCrusher.