# How CodeCompressor Handles AST-Aware Compression in Headroom

> Discover how CodeCompressor uses AST-aware compression in Headroom. It parses code into abstract syntax trees, scores symbols, and prunes definitions without losing validity.

- Repository: [Tejas Chopra/headroom](https://github.com/chopratejas/headroom)
- Tags: deep-dive
- Published: 2026-06-10

---

**AST-aware compression in the Headroom codebase parses source files into tree-sitter abstract syntax trees to identify structural elements, score symbol importance, and prune low-value definitions while preserving syntactic validity.**

The `CodeCompressor` class in the `chopratejas/headroom` repository implements a sophisticated, language-aware compression strategy that goes beyond simple text replacement. By leveraging tree-sitter parsers, it maintains the grammatical integrity of Python, Rust, JavaScript, and other supported languages while significantly reducing token count.

## What Is AST-Aware Compression?

AST-aware compression transforms source code into an **abstract syntax tree** (AST) before applying reduction strategies. Unlike regex-based minification, this approach understands the semantic structure of the code—distinguishing between import statements, class definitions, function bodies, and comments. The compressor preserves the minimum viable structure required for the code to remain parseable, even when removing implementation details.

## How CodeCompressor Implements AST-Aware Compression

### Tree-Sitter Integration and Language Detection

Before parsing, the compressor verifies that tree-sitter is available. In [`headroom/transforms/code_compressor.py`](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/code_compressor.py), the [`_check_tree_sitter_available()`](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/code_compressor.py#L63) function ensures the dependency is installed, while [`is_tree_sitter_available()`](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/code_compressor.py#L139) provides a cached boolean check.

Language detection occurs via [`detect_language(code: str) -> tuple[CodeLanguage, float]`](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/code_compressor.py#L522), which uses regex hints (such as `def` for Python or `#include` for C) to determine the source language and return a confidence score.

### Language-Specific AST Configuration

The compressor defines per-language node mappings in a configuration block starting at [line 218](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/code_compressor.py#L218). These configurations specify which AST node types correspond to functions, classes, and decorators.

For Python, the configuration at [lines 232-239](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/code_compressor.py#L232) defines:

```python
function_nodes=frozenset({"function_definition"})
class_nodes=frozenset({"class_definition"})
decorator_node="decorated_definition"

```

The [`_get_parser(language: str)`](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/code_compressor.py#L76) method returns a cached `tree_sitter.Language` object based on these configurations, ensuring that each language receives its specific parser grammar.

### Structure Extraction and Symbol Analysis

Once parsed, the AST is processed by [`_extract_structure(self, root: Any, code: str) -> CodeStructure`](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/code_compressor.py#L1656). This method populates a `CodeStructure` dataclass defined at [lines 327-337](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/code_compressor.py#L327), which captures:

- Import statements
- Top-level code blocks
- Class and function definitions
- Type declarations
- Comment blocks

The compressor then analyzes symbol importance via [`_analyze_symbol_importance(self, ...)`](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/code_compressor.py#L708), counting references and call-sites to generate a `symbol_scores` mapping (referenced at [line 428](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/code_compressor.py#L428)).

### Budget Allocation and Compression Strategy

With symbolism scored, the compressor allocates token budgets using [`_allocate_body_budget(self, analysis: _SymbolAnalysis, code: str) -> dict[str, int]`](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/code_compressor.py#L869). This method distributes a global compression budget across high-value symbols while marking low-usage definitions for removal.

### AST-Guided Code Pruning

The actual compression happens in language-specific AST walkers:

- [`_compress_function_ast(self, node: Any, ...)`](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/code_compressor.py#L1294) handles function bodies
- [`_compress_class_ast(self, node: Any, ...)`](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/code_compressor.py#L1550) processes class definitions

These methods decide whether to retain the full implementation, truncate the body, or replace it with a placeholder, always preserving decorators and signatures to maintain API contracts.

### Final Assembly and Output

The public entry point [`compress(self, code: str, language: str = "") -> DiffCompressionResult`](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/code_compressor.py#L922) orchestrates the pipeline. It routes input through [`_compress_with_ast`](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/code_compressor.py#L1106) and returns a `DiffCompressionResult` containing the compressed source and metadata.

## Practical Usage Examples

```python

# Compress a Python module

from headroom.transforms import CodeAwareCompressor

compressor = CodeAwareCompressor()
code = """
import numpy as np

def heavy_compute(x):
    # Complex numerical processing

    total = 0
    for i in range(1000):
        total += x * i
    return total

class DataProcessor:
    def process(self, data):
        return [heavy_compute(x) for x in data]
"""

result = compressor.compress(code, language="python")
print(result.compressed)        # Retains imports and signatures, removes loops

print(result.compression_ratio) # e.g., 0.52 (48% reduction)

```

```python

# Compress a Rust implementation

from headroom.transforms import CodeAwareCompressor

compressor = CodeAwareCompressor()
rust_code = """
use std::collections::HashMap;

pub struct Data {
    pub id: i32,
    pub payload: String,
}

impl Data {
    pub fn new(id: i32, payload: String) -> Self {
        Self { id, payload }
    }
    
    pub fn compute(&self) -> i32 {
        (0..1000).map(|i| i * self.id).sum()
    }
}
"""

result = compressor.compress(rust_code, language="rust")
print(result.compressed)  # Preserves struct and impl blocks, strips compute body

```

## Integration with the Content Router

The [`headroom/transforms/content_router.py`](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/content_router.py) file automatically routes code payloads to the `CodeAwareCompressor` when `enable_code_aware=True` (as documented in the `eager_load_compressors` section). This integration allows the system to select AST-aware compression for source files while falling back to `LogCompressor` or other strategies for non-code content.

## Summary

- **AST-aware compression** uses tree-sitter to parse source code into structured trees before pruning, ensuring syntactic validity.
- The `CodeCompressor` class in [`headroom/transforms/code_compressor.py`](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/code_compressor.py) implements detection, parsing, structure extraction, symbol scoring, and budget-based pruning.
- Language-specific configurations at [line 218](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/code_compressor.py#L218) define node types for functions, classes, and decorators per language.
- The `compress()` method at [line 922](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/code_compressor.py#L922) returns a `DiffCompressionResult` with the compressed source and ratio.

## Frequently Asked Questions

### What is the difference between AST-aware compression and basic text compression?

AST-aware compression parses code into an abstract syntax tree to understand semantic structure, allowing it to remove function bodies while preserving signatures and imports. Basic text compression treats code as opaque strings, often breaking syntax or removing critical structural elements.

### Which programming languages does the CodeCompressor support?

As implemented in the source, the compressor supports Python, JavaScript, TypeScript, Rust, C, C++, and Go. Each language has a configuration block (starting at [line 218](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/code_compressor.py#L218)) that maps tree-sitter node types to the compressor's internal structure model.

### How does the compressor decide which functions to keep or remove?

The `_analyze_symbol_importance()` method at [line 708](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/code_compressor.py#L708) calculates usage scores based on call frequency and reference count. The `_allocate_body_budget()` method at [line 869](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/code_compressor.py#L869) then distributes a token budget, prioritizing high-scoring symbols and omitting or truncating low-value definitions.

### Can I use the CodeCompressor without installing tree-sitter?

No. The compressor explicitly checks for tree-sitter availability via `_check_tree_sitter_available()` at [line 63](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/code_compressor.py#L63) and will raise an error or fall back to alternative compression methods if the parser is missing.