Kompress vs. SmartCrusher Compression and Latency Comparison in Headroom
SmartCrusher compresses JSON arrays in ~1 ms with 70-90% item reduction, while Kompress uses ML inference to compress plain text in 50-200 ms with 80-95% token reduction, and Headroom runs them sequentially in its TransformPipeline.
The chopratejas/headroom repository provides a dual-stage compression system designed to shrink LLM prompts before they reach the model. Choosing between Kompress and SmartCrusher depends on whether your bottleneck is token cost or request latency. This guide compares their internals, benchmarks, and configuration using the actual source files from the project.
SmartCrusher: Structural JSON Compression
SmartCrusher is the default, algorithmic compressor that ships enabled in every Headroom client. According to headroom/transforms/smart_crusher.py, it automatically inspects JSON arrays larger than 200 tokens and trims them while preserving schema-critical data.
The algorithm keeps the first and last N entries, retains anomalies such as errors and warnings, and preserves items that carry a high-information score. Because it never mutates the text inside individual elements, the output remains valid JSON and executes in roughly 1 ms per array.
Benchmarks recorded in wiki/benchmarks.md show that SmartCrusher achieves a 70-90% reduction in array items. This makes it ideal for compressing tool outputs, logs, and other structured payloads where deterministic, sub-millisecond latency is required.
Kompress: ML-Based Semantic Compression
Kompress (formerly LLMLINGUA) is an opt-in stage defined in headroom/transforms/kompress_compressor.py. It is only available when you install the headroom-ai[ml] extra, and it runs after SmartCrusher and other structural transforms.
Kompress loads a ModernBERT model via ONNX, together with tokenizers and tree-sitter parsers, to score each token for semantic redundancy. Tokens that fall below a confidence threshold are stripped, and the surviving tokens are reassembled into the compressed message. This per-token classification is CPU-intensive and adds significant latency.
As documented in wiki/benchmarks.md, typical CPU inference for Kompress ranges from 50 ms to 200 ms, with a P90 worst-case of approximately 608 ms when including the 32 ms request-handling overhead and 576 ms ONNX pipeline time. In exchange, Kompress delivers an 80-95% token-level reduction on plain text while preserving semantic meaning.
Pipeline Order and End-to-End Latency
Both compressors are wired into the TransformPipeline in headroom/transforms/content_router.py through the eager_load_compressors entry point. When both stages are active, Headroom processes them in a fixed order:
- SmartCrusher crushes JSON arrays in ~1 ms.
- Kompress compresses the remaining plain-text payload in ~50-200 ms.
For a typical LLM request containing a few kilobytes of text, the combined end-to-end latency is approximately ~150 ms. However, if your payload triggers the worst-case ONNX path, total latency can climb toward ~608 ms, which dominates the budget for real-time applications.
When to Use Each Compressor
Select the compressor that matches your data shape and latency constraints:
- SmartCrusher only: Best for deterministic JSON reduction where schema validity and negligible latency are mandatory.
- Kompress after SmartCrusher: Best for heavy text payloads when maximizing token savings justifies the ML inference cost.
- Disable Kompress: Best for real-time UIs that must keep total round-trip latency under ~50 ms.
- Enable both: Best for cost-optimized LLM prompting where an 80-95% token reduction outweighs the ~150 ms processing penalty.
Enabling and Benchmarking Both Stages
You can configure both compressors through the Headroom client API. SmartCrusher requires no extra dependencies, but Kompress needs the ML extra installed.
# Example: Enable both compressors in a Headroom client
from headroom import HeadroomClient, SmartCrusherConfig, KompressConfig
client = HeadroomClient(
# SmartCrusher runs automatically on JSON arrays
smart_crusher_config=SmartCrusherConfig(),
# Opt-in to Kompress (requires the `headroom-ai[ml]` extra)
kompress_config=KompressConfig(),
)
# Send a request with a long textual payload
response = client.compress(
messages=[
{"role": "user", "content": "..." * 5000} # large text
]
)
print(response.compressed_messages) # compressed by both stages
To measure the latency of each stage in isolation, instantiate the transform classes directly:
# Example: Benchmark the two stages individually
from headroom.transforms.smart_crusher import SmartCrusher, SmartCrusherConfig
from headroom.transforms.kompress_compressor import KompressCompressor
# SmartCrusher only
sc = SmartCrusher(SmartCrusherConfig())
sc_stats = sc.compress(json_array=[{"id": i, "data": "x"} for i in range(1000)])
print(f"SmartCrusher kept {len(sc_stats)} items (≈1 ms)")
# Kompress only (requires the ML extra)
kc = KompressCompressor()
kc_stats = kc.compress(" ".join(["word"] * 2000))
print(f"Kompress reduced to {len(kc_stats.tokens)} tokens (≈150 ms)")
Summary
- SmartCrusher handles JSON arrays algorithmically in
headroom/transforms/smart_crusher.py, delivering 70-90% item reduction in about 1 ms. - Kompress handles plain text via an ONNX ModernBERT model in
headroom/transforms/kompress_compressor.py, delivering 80-95% token reduction at a cost of 50-200 ms typical latency. - Headroom orchestrates both stages through
eager_load_compressorsinheadroom/transforms/content_router.py, running SmartCrusher first and Kompress second. - For strict latency budgets, rely on SmartCrusher alone; for maximum token savings, enable both stages and accept the ML inference overhead.
Frequently Asked Questions
How do SmartCrusher and Kompress differ architecturally?
SmartCrusher is a deterministic JSON array compressor implemented in headroom/transforms/smart_crusher.py that trims array boundaries while preserving anomalies and schema-critical entries. Kompress is a probabilistic, ML-based token classifier implemented in headroom/transforms/kompress_compressor.py that uses a ModernBERT ONNX model to remove semantically redundant tokens from plain text.
What latency should I expect from Kompress vs. SmartCrusher?
SmartCrusher typically adds ~1 ms per JSON array according to wiki/benchmarks.md. Kompress adds roughly 50-200 ms for standard CPU inference, with a P90 worst-case of approximately 608 ms when ONNX pipeline overhead is included.
Should I enable Kompress if my application has a strict latency budget?
No. Kompress is CPU-intensive and dominates the latency budget, so applications requiring sub-50 ms round trips should disable the ML compressor and rely solely on SmartCrusher. The structural compressor guarantees schema validity without model inference.
How do I enable Kompress in the Headroom TransformPipeline?
Install the headroom-ai[ml] extra and pass kompress_config=KompressConfig() to HeadroomClient, as detailed in wiki/configuration.md. The pipeline wiring in headroom/transforms/content_router.py will then include Kompress inside eager_load_compressors, running it automatically after SmartCrusher.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →