How Kompress-base ML Model Achieves Text Compression While Preserving Meaning
Kompress-base uses a ModernBERT-style dual-head neural network to classify each token as keep or drop, while a span-level CNN preserves phrase coherence, achieving 30-70% compression without semantic loss.
The Kompress-base model powers intelligent text compression in the chopratejas/headroom repository by combining token-level classification with span-aware scoring. Unlike naive truncation or keyword removal, this ModernBERT-based architecture learns to identify semantically critical tokens while maintaining natural language flow. The model achieves significant size reduction—often compressing text to 30-50% of its original length—by treating compression as a supervised token classification task trained on agentic interaction traces.
Core Architecture: Dual-Head Token Classification
The model is defined in headroom/transforms/kompress_compressor.py within the HeadroomCompressorModel class (lines 37-55). It employs two specialized heads that work in concert:
Token Classification Head
A binary classifier using a linear projection (Linear → 2) that outputs logits for each input token. This head predicts whether individual tokens should be kept or dropped based on local context and semantic importance.
Span Scoring Head
A 1-D convolutional network (Conv1d → GELU → Conv1d → Sigmoid) that evaluates contiguous token spans. This captures longer-range dependencies, ensuring that borderline tokens within important phrases receive boosted scores. The span head prevents fragmentation of coherent expressions by identifying semantically relevant segments that should remain intact.
The Compression Pipeline
The compression workflow in headroom/transforms/kompress_compressor.py processes text through deterministic stages ranging from model initialization to final reconstruction:
1. Lazy Model Loading
The _load_kompress() function (lines 39-66) downloads model weights (onnx/kompress-int8.onnx or model.safetensors) and the answerdotai/ModernBERT-base tokenizer from HuggingFace only when first requested. This lazy initialization ensures Headroom operates without ML dependencies until compression is explicitly invoked.
2. Word-Aware Tokenization
The compress() method tokenizes input while preserving word_ids mappings to align token-level decisions with original word boundaries. This ensures reconstruction maintains valid word sequences rather than arbitrary subword fragments.
3. Dual-Path Inference
Depending on configuration, the model generates either:
get_keep_mask: A boolean mask indicating which tokens to retain (used whentarget_ratioisNone)get_scores: Per-token probability scores used for ratio-targeted compression
The implementation abstracts both ONNX (lightweight CPU) and PyTorch (GPU-accelerated) backends through the _OnnxModel class (lines 99-115) and the native PyTorch HeadroomCompressorModel.
4. Ratio-Based Selection with Span Boosting
When target_ratio is specified (e.g., 0.4 for 40% retention), scores are sorted per word and the top-k tokens satisfying the ratio are preserved. The span head boosts tokens within high-scoring regions, preventing the selection of isolated words that would break semantic coherence.
5. CCR Caching
If compression achieves a ratio less than 0.8 (indicating significant reduction), the result enters the Compress-Cache-Retrieve (CCR) store via _store_in_ccr() (lines 117-124). This allows retrieval of the original text without re-running neural inference.
6. Chunk Processing and Batching
Inputs exceeding 350 words (configurable via chunk_words in KompressConfig) are processed in chunks to respect ModernBERT's 512-token limit. The compress_batch() method (lines 70-90) batches multiple chunks on GPU backends, though ONNX CPU execution falls back to sequential processing via _should_use_sequential_fallback().
Practical Code Examples
from headroom.transforms.kompress_compressor import KompressCompressor, KompressConfig
# Configure compression parameters
cfg = KompressConfig(chunk_words=350, score_threshold=0.5, enable_ccr=True)
compressor = KompressCompressor(cfg)
# Automatic compression (model decides keep/drop based on native predictions)
result = compressor.compress(long_text)
print(f"Compressed to {result.compression_ratio:.1%}: {result.compressed}")
# Explicit ratio control: keep exactly 40% of content
result = compressor.compress(long_text, target_ratio=0.4)
print(f"Kept {len(result.compressed)} characters from {len(long_text)}")
# Batch processing for multiple documents with GPU acceleration
texts = [doc1, doc2, doc3]
results = compressor.compress_batch(texts)
for r in results:
print(f"Ratio: {r.compression_ratio:.2%}, Length: {len(r.compressed)}")
# Inspecting underlying token scores for debugging
model, tokenizer, backend = compressor._load_kompress()
encoding = tokenizer(
["example", "sentence"],
is_split_into_words=True,
return_tensors="pt",
truncation=True,
max_length=512,
padding=True
)
scores = model.get_scores(encoding["input_ids"], encoding["attention_mask"])
Summary
- Kompress-base employs a ModernBERT architecture with dual prediction heads to achieve semantic compression in
chopratejas/headroom. - The token head classifies individual words as keep/drop, while the span head uses 1-D CNNs to preserve coherent phrases.
- The pipeline supports both ONNX (CPU) and PyTorch (GPU) backends, with automatic batching for high-throughput scenarios.
- CCR caching stores results when compression ratios exceed 20% reduction (ratio < 0.8), eliminating redundant inference.
- Practical compression ratios of 30-70% are achieved while maintaining semantic fidelity through context-aware training on agentic traces.
Frequently Asked Questions
How does Kompress-base decide which words to keep?
The model uses a binary classification head that assigns keep/drop probabilities to each token. During inference, it either applies the model's native keep-mask via get_keep_mask() or selects the highest-scoring tokens up to a specified target_ratio. The span head additionally boosts scores for tokens within important contiguous regions, ensuring phrases remain intact rather than selecting isolated words.
What is the difference between ONNX and PyTorch backends?
The ONNX backend (onnx/kompress-int8.onnx) provides optimized CPU inference with lower memory overhead, suitable for single chunks or sequential processing. The PyTorch backend supports GPU acceleration and batch processing via compress_batch(), making it preferable for high-throughput scenarios with multiple documents, though ONNX CPU execution falls back to sequential processing as the provider does not parallelize the batch dimension.
Why does the model process text in 350-word chunks?
This default chunk size ensures inputs stay within ModernBERT's 512-token limit while accounting for subword tokenization. Processing in chunks via the compress() method allows the model to handle arbitrarily long inputs by compressing segments independently and concatenating the retained words, with the word_ids mapping ensuring proper alignment between tokens and original word positions.
How does the CCR cache improve performance?
The Compress-Cache-Retrieve mechanism stores compression results in _store_in_ccr() when the achieved ratio is favorable (less than 0.8). Subsequent requests for the same text retrieve the cached result rather than re-running neural inference, eliminating model latency for repeated content and reducing computational overhead in production environments.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →