# How Kompress-base ML Model Achieves Text Compression While Preserving Meaning

> Discover how Kompress-base ML model achieves impactful text compression of 30-70% without losing meaning. Learn about its dual-head neural network and CNN approach.

- Repository: [Tejas Chopra/headroom](https://github.com/chopratejas/headroom)
- Tags: deep-dive
- Published: 2026-06-06

---

**Kompress-base uses a ModernBERT-style dual-head neural network to classify each token as keep or drop, while a span-level CNN preserves phrase coherence, achieving 30-70% compression without semantic loss.**

The **Kompress-base** model powers intelligent text compression in the `chopratejas/headroom` repository by combining token-level classification with span-aware scoring. Unlike naive truncation or keyword removal, this ModernBERT-based architecture learns to identify semantically critical tokens while maintaining natural language flow. The model achieves significant size reduction—often compressing text to 30-50% of its original length—by treating compression as a supervised token classification task trained on agentic interaction traces.

## Core Architecture: Dual-Head Token Classification

The model is defined in [`headroom/transforms/kompress_compressor.py`](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/kompress_compressor.py) within the `HeadroomCompressorModel` class (lines 37-55). It employs two specialized heads that work in concert:

### Token Classification Head

A binary classifier using a linear projection (`Linear → 2`) that outputs logits for each input token. This head predicts whether individual tokens should be **kept** or **dropped** based on local context and semantic importance.

### Span Scoring Head

A 1-D convolutional network (`Conv1d → GELU → Conv1d → Sigmoid`) that evaluates contiguous token spans. This captures longer-range dependencies, ensuring that borderline tokens within important phrases receive boosted scores. The span head prevents fragmentation of coherent expressions by identifying semantically relevant segments that should remain intact.

## The Compression Pipeline

The compression workflow in [`headroom/transforms/kompress_compressor.py`](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/kompress_compressor.py) processes text through deterministic stages ranging from model initialization to final reconstruction:

### 1. Lazy Model Loading

The `_load_kompress()` function (lines 39-66) downloads model weights (`onnx/kompress-int8.onnx` or `model.safetensors`) and the `answerdotai/ModernBERT-base` tokenizer from HuggingFace only when first requested. This lazy initialization ensures Headroom operates without ML dependencies until compression is explicitly invoked.

### 2. Word-Aware Tokenization

The `compress()` method tokenizes input while preserving `word_ids` mappings to align token-level decisions with original word boundaries. This ensures reconstruction maintains valid word sequences rather than arbitrary subword fragments.

### 3. Dual-Path Inference

Depending on configuration, the model generates either:
- **`get_keep_mask`**: A boolean mask indicating which tokens to retain (used when `target_ratio` is `None`)
- **`get_scores`**: Per-token probability scores used for ratio-targeted compression

The implementation abstracts both ONNX (lightweight CPU) and PyTorch (GPU-accelerated) backends through the `_OnnxModel` class (lines 99-115) and the native PyTorch `HeadroomCompressorModel`.

### 4. Ratio-Based Selection with Span Boosting

When `target_ratio` is specified (e.g., 0.4 for 40% retention), scores are sorted per word and the top-k tokens satisfying the ratio are preserved. The span head boosts tokens within high-scoring regions, preventing the selection of isolated words that would break semantic coherence.

### 5. CCR Caching

If compression achieves a ratio less than 0.8 (indicating significant reduction), the result enters the **Compress-Cache-Retrieve (CCR)** store via `_store_in_ccr()` (lines 117-124). This allows retrieval of the original text without re-running neural inference.

### 6. Chunk Processing and Batching

Inputs exceeding 350 words (configurable via `chunk_words` in `KompressConfig`) are processed in chunks to respect ModernBERT's 512-token limit. The `compress_batch()` method (lines 70-90) batches multiple chunks on GPU backends, though ONNX CPU execution falls back to sequential processing via `_should_use_sequential_fallback()`.

## Practical Code Examples

```python
from headroom.transforms.kompress_compressor import KompressCompressor, KompressConfig

# Configure compression parameters

cfg = KompressConfig(chunk_words=350, score_threshold=0.5, enable_ccr=True)
compressor = KompressCompressor(cfg)

# Automatic compression (model decides keep/drop based on native predictions)

result = compressor.compress(long_text)
print(f"Compressed to {result.compression_ratio:.1%}: {result.compressed}")

```

```python

# Explicit ratio control: keep exactly 40% of content

result = compressor.compress(long_text, target_ratio=0.4)
print(f"Kept {len(result.compressed)} characters from {len(long_text)}")

```

```python

# Batch processing for multiple documents with GPU acceleration

texts = [doc1, doc2, doc3]
results = compressor.compress_batch(texts)
for r in results:
    print(f"Ratio: {r.compression_ratio:.2%}, Length: {len(r.compressed)}")

```

```python

# Inspecting underlying token scores for debugging

model, tokenizer, backend = compressor._load_kompress()
encoding = tokenizer(
    ["example", "sentence"], 
    is_split_into_words=True,
    return_tensors="pt", 
    truncation=True, 
    max_length=512, 
    padding=True
)
scores = model.get_scores(encoding["input_ids"], encoding["attention_mask"])

```

## Summary

- **Kompress-base** employs a **ModernBERT** architecture with dual prediction heads to achieve semantic compression in `chopratejas/headroom`.
- The **token head** classifies individual words as keep/drop, while the **span head** uses 1-D CNNs to preserve coherent phrases.
- The pipeline supports both **ONNX** (CPU) and **PyTorch** (GPU) backends, with automatic batching for high-throughput scenarios.
- **CCR caching** stores results when compression ratios exceed 20% reduction (ratio < 0.8), eliminating redundant inference.
- Practical compression ratios of 30-70% are achieved while maintaining semantic fidelity through context-aware training on agentic traces.

## Frequently Asked Questions

### How does Kompress-base decide which words to keep?

The model uses a binary classification head that assigns keep/drop probabilities to each token. During inference, it either applies the model's native keep-mask via `get_keep_mask()` or selects the highest-scoring tokens up to a specified `target_ratio`. The span head additionally boosts scores for tokens within important contiguous regions, ensuring phrases remain intact rather than selecting isolated words.

### What is the difference between ONNX and PyTorch backends?

The ONNX backend (`onnx/kompress-int8.onnx`) provides optimized CPU inference with lower memory overhead, suitable for single chunks or sequential processing. The PyTorch backend supports GPU acceleration and batch processing via `compress_batch()`, making it preferable for high-throughput scenarios with multiple documents, though ONNX CPU execution falls back to sequential processing as the provider does not parallelize the batch dimension.

### Why does the model process text in 350-word chunks?

This default chunk size ensures inputs stay within ModernBERT's 512-token limit while accounting for subword tokenization. Processing in chunks via the `compress()` method allows the model to handle arbitrarily long inputs by compressing segments independently and concatenating the retained words, with the `word_ids` mapping ensuring proper alignment between tokens and original word positions.

### How does the CCR cache improve performance?

The **Compress-Cache-Retrieve** mechanism stores compression results in `_store_in_ccr()` when the achieved ratio is favorable (less than 0.8). Subsequent requests for the same text retrieve the cached result rather than re-running neural inference, eliminating model latency for repeated content and reducing computational overhead in production environments.