# How SearchCompressor Optimizes Grep and Ripgrep Results in Headroom

> Learn how SearchCompressor optimizes grep and ripgrep results using a four stage Rust pipeline to preserve relevant matches while shrinking output.

- Repository: [Tejas Chopra/headroom](https://github.com/chopratejas/headroom)
- Tags: how-to-guide
- Published: 2026-06-07

---

**SearchCompressor shrinks raw grep and ripgrep output through a four-stage Rust pipeline—parse, score, select, and format—while preserving the most contextually relevant matches.**

The `SearchCompressor` in `chopratejas/headroom` is a dedicated transformer that turns large volumes of search-tool output into concise, LLM-friendly snippets. Whether you are processing `grep` logs or `ripgrep` codebase results, this component keeps only the most relevant lines. It combines a high-performance Rust core with a thin Python shim to deliver both speed and a stable public API.

## The Four-Stage Optimization Pipeline

The `SearchCompressor` processes raw search results through four tightly coupled stages. The heavy lifting happens in Rust for speed, while the Python shim exposes the logic to the rest of the codebase.

### Parse Search Results with Path-Aware Line Detection

In [`crates/headroom-core/src/transforms/search_compressor.rs`](https://github.com/chopratejas/headroom/blob/main/crates/headroom-core/src/transforms/search_compressor.rs), the Rust parser recognizes the line-number marker (`:<num>:`) and extracts the **file path**, **line number**, and **matched line** for every result. The Python shim forwards the text to the Rust helper `parse_search_lines` via `SearchCompressor._parse_search_results` and builds `SearchMatch` and `FileMatches` dataclasses.

This parser correctly handles **Windows drive letters** (`C:\…`) and filenames that contain dashes—cases that the previous pure-Python version missed.

### Score Matches Against User Context

Once parsed, each `SearchMatch` is scored against the user-provided context. The `SearchCompressor._score_matches` method scores lines based on keyword overlap, a hard-coded list of error-priority regexes (`PRIORITY_PATTERNS_SEARCH`), and any extra `context_keywords` supplied in the configuration. Matches that hit known error patterns receive an extra boost of `+0.5`, decreasing by `0.1` per pattern, which helps critical error lines survive downsizing.

### Select Matches Using Adaptive Sizing

The `SearchCompressor._select_matches` method runs an adaptive-sizing algorithm through `compute_optimal_k`, imported from `headroom.transforms.adaptive_sizer`, to decide the global match budget. It always preserves the **first and last match of each file** when `always_keep_first` and `always_keep_last` are enabled, then picks the highest-scoring lines up to the per-file and global limits (`max_matches_per_file`, `max_total_matches`).

This yields a compact yet representative subset of the original result set without hard-coding a static count.

### Format Output and Persist to Cache

The chosen matches are reassembled into the original `file:line:content` format by `SearchCompressor._format_output`. A short summary (`[… and N more matches in file]`) is appended for each truncated file. If the compression-cache-registry (CCR) is enabled, the Rust core returns a `cache_key`, and the shim stores it in the Python `CompressionStore` via `_persist_to_python_ccr` so the compressed blob can be retrieved later.

## Key Performance Optimizations

According to the `chopratejas/headroom` source code, the `SearchCompressor` achieves its speed and accuracy through several targeted techniques:

- **Path-aware Rust parsing** – The Rust parser handles Windows drive letters and dash-containing filenames, eliminating dropped matches that plagued the earlier pure-Python parser.
- **Error-priority boosting** – Matches aligned with known error patterns in `PRIORITY_PATTERNS_SEARCH` receive a decaying score bonus, ensuring high-signal lines survive filtering.
- **Adaptive total-match budgeting** – `compute_optimal_k` dynamically sets the global match budget based on a configurable bias parameter, respecting token limits without resorting to fixed cutoffs.
- **CCR integration** – The compressor writes a hash-based cache key to the `CompressionStore`, deduplicating identical search results across requests and reducing downstream API token usage.

## Practical Code Examples

The following examples demonstrate how to use `SearchCompressor` to process raw `grep` or `ripgrep` output.

Compress ripgrep output with context keywords:

```python
from headroom.transforms.search_compressor import SearchCompressor, SearchCompressorConfig

raw_results = """
src/main.py:10:def login(user):  # auth entry point

src/main.py:45:raise AuthError("invalid password")
src/utils.py:12:# helper

"""

compressor = SearchCompressor(
    SearchCompressorConfig(
        max_matches_per_file=3,
        always_keep_first=True,
        always_keep_last=True,
        context_keywords=["auth", "error"],
    )
)

result = compressor.compress(raw_results, context="auth error")
print("Compressed output:")
print(result.compressed)
print("Saved ~", result.tokens_saved_estimate, "tokens")

```

Use the bias parameter for more aggressive pruning:

```python
compressor = SearchCompressor()
high_bias = compressor.compress(raw_results, bias=0.5)  # lower bias → drop more

print(high_bias.compressed)

```

Exercise the internal helpers directly for testing or debugging:

```python
parser = SearchCompressor()
parsed = parser._parse_search_results(raw_results)
parser._score_matches(parsed, "authentication")
selected = parser._select_matches(parsed, bias=1.2)
compressed, summaries = parser._format_output(selected, parsed)
print(compressed)
print(summaries)

```

## Core Source Files

- **[`headroom/transforms/search_compressor.py`](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/search_compressor.py)** – The Python shim that exposes the public API, legacy helpers, and CCR persistence logic.
- **[`crates/headroom-core/src/transforms/search_compressor.rs`](https://github.com/chopratejas/headroom/blob/main/crates/headroom-core/src/transforms/search_compressor.rs)** – The Rust implementation that performs fast parsing, scoring, and adaptive selection.
- **[`tests/test_transforms_search_compressor.py`](https://github.com/chopratejas/headroom/blob/main/tests/test_transforms_search_compressor.py)** – Unit tests that validate end-to-end behavior from parsing through compression.

## Summary

- `SearchCompressor` runs a four-stage pipeline—**parse**, **score**, **select**, and **format**—to shrink grep and ripgrep results.
- The Rust core in [`crates/headroom-core/src/transforms/search_compressor.rs`](https://github.com/chopratejas/headroom/blob/main/crates/headroom-core/src/transforms/search_compressor.rs) handles path-aware parsing and error-priority boosting for maximum accuracy.
- An adaptive-sizing algorithm (`compute_optimal_k`) dynamically budgets matches based on configurable per-file and global limits.
- CCR integration caches compressed results via a Python `CompressionStore`, reducing redundant processing and API token consumption.

## Frequently Asked Questions

### How does SearchCompressor handle Windows file paths in grep output?

The Rust parser in [`crates/headroom-core/src/transforms/search_compressor.rs`](https://github.com/chopratejas/headroom/blob/main/crates/headroom-core/src/transforms/search_compressor.rs) detects line-number markers while preserving Windows drive letters and filenames that contain dashes. This prevents the parser from incorrectly splitting paths or dropping valid matches, which was a known limitation of the earlier pure-Python implementation.

### What makes the Rust implementation faster than a pure-Python compressor?

The heavy-lifting stages—parsing, line-importance detection, and adaptive sizing—are executed in Rust, which runs orders of magnitude faster than equivalent Python loops. The Python shim merely forwards text and manages configuration, keeping the public API stable without sacrificing performance.

### How does the adaptive sizing algorithm decide how many matches to keep?

The compressor calls `compute_optimal_k` from `headroom.transforms.adaptive_sizer` to determine the global match budget from a bias parameter. It then respects `max_matches_per_file` and `max_total_matches`, while optionally preserving the first and last match of each file, so the output stays within token limits without using a hard-coded static count.

### Can I customize which matches get priority during scoring?

Yes. You can supply `context_keywords` in `SearchCompressorConfig` to boost lines that overlap with your query. Additionally, the built-in `PRIORITY_PATTERNS_SEARCH` regexes automatically elevate matches that look like errors, applying a decaying bonus of `+0.5` per pattern match to help critical lines survive downsizing.