# How CodeAwareCompressor Preserves Syntax Validity When Compressing Source Code

> Discover how CodeAwareCompressor ensures valid code output by using ASTs and Tree-sitter validation instead of text manipulation. Compress your source code reliably.

- Repository: [Tejas Chopra/headroom](https://github.com/chopratejas/headroom)
- Tags: how-to-guide
- Published: 2026-06-08

---

**CodeAwareCompressor guarantees syntactically valid output by operating on the Abstract Syntax Tree (AST) rather than raw text, then validating the result with a second Tree-sitter parse before returning.**

Headroom's `CodeAwareCompressor` (located in [`headroom/transforms/code_compressor.py`](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/code_compressor.py)) solves the challenge of reducing token count without breaking code execution. Unlike text-based minifiers that risk creating unparseable output, this compressor treats source code as a structured tree, ensuring that every transformation maintains the grammatical correctness required by compilers and interpreters.

## AST-Based Architecture for Guaranteed Validity

The compressor achieves syntax-safe reduction by working at the AST level. This approach eliminates text-based heuristics that might accidentally drop required syntax tokens.

### Tree-sitter Integration and Thread Safety

The foundation of reliable parsing lies in Tree-sitter integration. The compressor creates a thread-local parser instance through `_get_parser` (lines 75-96), caching the parser per thread to ensure thread-safe operation without shared state mutation. This guarantees that parallel execution in thread pools never corrupts the parsing state.

### Language Detection with Error Minimization

Before transformation, `detect_language` (lines 622-698) identifies the source language by running lightweight regex pre-filters (`_LANGUAGE_PREFILTER`) and parsing with candidate grammars. It selects the language producing the **fewest error nodes** via `_count_error_nodes`, returning a confidence score that ensures the correct grammar is applied before any AST manipulation begins.

## The Compression Workflow

Once language detection completes, the compressor extracts and analyzes code structure to determine what can safely be removed while preserving essential interfaces.

### Structure Extraction via AST Walking

The `_extract_structure` method (lines 610-678) walks the AST and collects critical nodes into a `CodeStructure` object. Because this extraction works directly on the tree representation, it accurately preserves:
- Import statements
- Type and class definitions
- Function signatures
- Docstrings (according to configured `DocstringMode`)
- Comments and top-level code

### Symbol Importance Analysis

A lightweight intra-file analysis performed by `_analyze_symbol_importance` (lines 810-928) scores every symbol—functions, classes, and methods—based on reference count, fan-out, visibility, and optional context matching. These scores, normalized to a 0-1 range, drive the **budget allocation** process to ensure high-importance symbols retain more content.

### Body-Budget Allocation and AST Trimming

Using the target compression rate, `_allocate_body_budget` (lines 970-1008) distributes line budgets across symbols. High-importance symbols receive more lines, while low-importance ones are truncated aggressively. The compressor then isolates body nodes and trims them to the allowed line count via `_compress_function_ast` and `_compress_class_ast`.

## Guaranteed Syntax Validation

The compressor implements multiple safety mechanisms to ensure output validity.

### Post-Compression Validation

After assembling the compressed source, the compressor parses the result **again** with Tree-sitter and executes `_count_error_nodes` (lines 1012-1025). If any `ERROR` or `MISSING` nodes are detected, the compressor discards the compressed output and returns the original source, ensuring *nothing* invalid is ever emitted.

### Graceful Fallback Mechanisms

If Tree-sitter is unavailable or the language is unknown, the compressor optionally falls back to the generic `Kompress` compressor via `_fallback_compress` rather than risking syntax errors. This defensive design prevents malformed output when the AST-based approach cannot safely execute.

## Implementation and Usage

The following example demonstrates basic usage of the `CodeAwareCompressor`:

```python
from headroom.transforms import CodeAwareCompressor

compressor = CodeAwareCompressor()
result = compressor.compress('''
import os
from typing import List

def process_data(items: List[str]) -> List[str]:
    """Process a list of items."""
    results = []
    for item in items:
        if not item:
            continue
        processed = item.strip().lower()
        results.append(processed)
    return results
''')
print(result.compressed)      # Valid Python code, bodies compressed

print(result.syntax_valid)    # → True

```

### Inspecting Compression Metadata

The result object provides detailed metadata about the compression process:

```python
print(result.compression_ratio)   # e.g. 0.42

print(result.preserved_imports)   # 2

print(result.preserved_signatures)  # 1

print(result.symbol_scores)       # {'process_data': 0.87}

```

### Activating via Command Line

Enable the compressor in the Headroom proxy pipeline using:

```bash
headroom proxy --code-aware

```

This flag activates the transformer in the content router ([`headroom/transforms/content_router.py`](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/content_router.py)), routing source code through the AST-based compression pipeline.

## Summary

- **AST-Based Processing**: `CodeAwareCompressor` operates on the Abstract Syntax Tree in [`headroom/transforms/code_compressor.py`](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/code_compressor.py) rather than raw text, preventing syntax-breaking text manipulation.
- **Dual-Parse Validation**: The compressor validates output via a second Tree-sitter parse (`_count_error_nodes`) and discards invalid results before return.
- **Structure Preservation**: The `_extract_structure` method ensures imports, signatures, and type definitions remain intact while only function bodies are compressed.
- **Intelligent Budgeting**: Symbol importance scoring (`_analyze_symbol_importance`) and budget allocation (`_allocate_body_budget`) prioritize critical code sections.
- **Thread-Safe Operation**: Thread-local parser caching via `_get_parser` enables safe parallel execution without shared state mutation.
- **Defensive Fallback**: Unknown languages or missing dependencies trigger `_fallback_compress` to avoid syntax errors.

## Frequently Asked Questions

### How does CodeAwareCompressor handle multiple programming languages?

The compressor uses data-driven language configuration via `_LANG_CONFIGS` (lines 200-229), where each `LangConfig` defines AST node types for imports, functions, and classes. Language detection runs candidate parsers through `detect_language` and selects the grammar producing the fewest error nodes, ensuring correct parsing regardless of whether the source is Python, JavaScript, or another supported language.

### What happens if the compressed code contains syntax errors?

The compressor runs post-compression validation using `_count_error_nodes` (lines 1012-1025) to check for `ERROR` or `MISSING` nodes in the Tree-sitter AST. If any errors are detected, the compressor discards the compressed result and returns the original source code unchanged, guaranteeing that `result.syntax_valid` is always `True` for any returned compressed output.

### Can CodeAwareCompressor run safely in multi-threaded environments?

Yes. The implementation uses thread-local storage through `_get_parser` (lines 75-96) to cache Tree-sitter parsers per thread. This design avoids shared state mutation, making the compressor safe for parallel execution in thread pools while maintaining high performance through parser reuse within each thread.

### Which parts of the code does the compressor preserve versus remove?

The compressor preserves all structural elements including import statements, function signatures, class definitions, decorators, and return type annotations. It removes or truncates only function and method bodies based on the allocated line budget from `_allocate_body_budget` (lines 970-1008), ensuring the public API and type signatures remain intact and callable.