# What Is AST-Aware Compression and How Does CodeCompressor Use It for Different Languages?

> Discover AST-aware compression and how CodeCompressor in chopratejas/headroom leverages abstract syntax trees for efficient code reduction across Python, Rust, and JavaScript.

- Repository: [Tejas Chopra/headroom](https://github.com/chopratejas/headroom)
- Tags: deep-dive
- Published: 2026-06-07

---

**AST-aware compression parses source code into a language-specific abstract syntax tree so that the `CodeCompressor` class in `chopratejas/headroom` can score symbol importance, discard low-value nodes, and reassemble valid code across Python, Rust, JavaScript, and other languages.**

In the `chopratejas/headroom` repository, **AST-aware compression** transforms source code by leveraging tree-sitter parsers to understand each language's grammar rather than treating files as plain text. This approach enables the compressor to intelligently remove or truncate low-importance functions, classes, and decorators while keeping imports, signatures, and export statements intact. The result is a syntactically valid, reduced payload that respects language-specific node types such as Python's `function_definition` and Rust's `impl_item`.

## What Is AST-Aware Compression?

**AST-aware compression** is a technique that parses source code into an abstract syntax tree (AST) before removing or truncating content. Instead of treating a file as plain text, the compressor understands structural elements such as imports, class definitions, function bodies, and decorators. It scores each symbol based on references and usage, then prunes low-value nodes while reassembling the remaining AST into a valid source file.

The `chopratejas/headroom` implementation relies on **tree-sitter** to perform this parsing. Because tree-sitter exposes language-specific node types—such as Python's `function_definition` or Rust's `fn`—the same compression engine can operate across multiple languages without brittle text-based heuristics.

## How CodeCompressor Implements AST-Aware Compression for Different Languages

### Checking Tree-Sitter Availability

In [[`headroom/transforms/code_compressor.py`](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/code_compressor.py)](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/code_compressor.py), the compressor first verifies that the `tree_sitter` Python package is present. The helper [`_check_tree_sitter_available()`](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/code_compressor.py#L63) performs this check, while [`is_tree_sitter_available()`](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/code_compressor.py#L139) exposes a public guard for other modules. If the parser is unavailable, the system falls back to non-AST compression strategies.

### Selecting a Language-Specific Parser

Once tree-sitter is confirmed, the compressor obtains a cached parser instance via [`_get_parser(language: str)`](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/code_compressor.py#L76). This function returns a `tree_sitter.Language` object that is built once per process and reused for subsequent requests, minimizing overhead when compressing batches of files.

### Detecting the Programming Language

When a caller does not explicitly specify a language, the compressor infers it from source hints. The [`detect_language(code: str)`](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/code_compressor.py#L522) function scans for regex patterns—such as `def` and `class` for Python or `#include` for C—and returns a confidence score along with the guessed language. This step ensures the correct grammar is loaded before any AST traversal begins.

### Configuring Per-Language AST Node Maps

Each supported language declares which tree-sitter node types represent functions, classes, and decorators. The configuration block beginning at [line 218](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/code_compressor.py#L218) defines frozen sets for every grammar. For example, the Python configuration near [lines 232-239](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/code_compressor.py#L232) uses:

- `function_nodes = frozenset({"function_definition"})`
- `class_nodes = frozenset({"class_definition"})`
- `decorator_node = "decorated_definition"`

Comparable definitions exist for JavaScript, TypeScript, Rust, C, C++, and Go, allowing the compressor to treat language-specific syntax uniformly.

### Extracting Structure from the AST

After parsing, the compressor walks the AST to populate a typed representation. The [`_extract_structure(self, root, code)`](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/code_compressor.py#L1656) method at [line 1656](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/code_compressor.py#L1656) collects imports, top-level statements, class definitions, function signatures, type definitions, comments, and other nodes into separate lists. The [`CodeStructure`](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/code_compressor.py#L327) dataclass, declared at [lines 327-337](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/code_compressor.py#L327), gives downstream logic a language-agnostic view of the file.

### Scoring Symbols and Allocating Budget

Before any code is removed, the compressor determines what to keep. The [`_analyze_symbol_importance(self, ...)`](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/code_compressor.py#L708) method counts references, import usage, and call sites to build a `symbol_scores` mapping at [line 428](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/code_compressor.py#L428). High-scoring symbols survive intact, while low-scoring candidates become eligible for truncation. The budget allocator [`_allocate_body_budget(self, analysis, code)`](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/code_compressor.py#L869) then translates a global token limit into per-definition allowances.

### Compressing Functions and Classes via AST Subtrees

With budgets assigned, the compressor edits individual AST nodes. [`_compress_function_ast(self, node, ...)`](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/code_compressor.py#L1294) decides whether to preserve a function body, truncate it, or drop it entirely. Similarly, [`_compress_class_ast(self, node, ...)`](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/code_compressor.py#L1550) handles class definitions. Both methods ensure that required syntax—indentation, decorators, and closing delimiters—remains intact even when inner statements are removed.

### Reassembling the Final Source String

The top-level [`compress(self, code, language="")`](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/code_compressor.py#L922) method orchestrates the entire pipeline. It delegates to [`_compress_with_ast`](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/code_compressor.py#L1106), which runs parsing, detection, extraction, scoring, and AST-guided compression in sequence. The return value is a `DiffCompressionResult` containing the compressed source string and metadata such as the compression ratio.

### Routing Code Payloads Automatically

The compressor integrates with the broader transform system in [`headroom/transforms/content_router.py`](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/content_router.py). When `enable_code_aware=True`, the router loads `CodeAwareCompressor` and automatically selects it for detected code payloads. If the content is not code, the router falls back to other compressors such as `LogCompressor`.

### Documentation and Feature History

High-level usage notes for the compressor are documented in [[`wiki/transforms.md`](https://github.com/chopratejas/headroom/blob/main/wiki/transforms.md)](https://github.com/chopratejas/headroom/blob/main/wiki/transforms.md) under the `CodeAwareCompressor` section. The feature's addition is recorded in [[`CHANGELOG.md`](https://github.com/chopratejas/headroom/blob/main/CHANGELOG.md)](https://github.com/chopratejas/headroom/blob/main/CHANGELOG.md) around line 362.

## Practical Code Examples

### Compressing a Python Snippet

Here is a Python example that imports the compressor and processes a function with a heavy loop:

```python

# Example 1 – compress a Python snippet

from headroom.transforms import CodeAwareCompressor
compressor = CodeAwareCompressor()
code = """
import numpy as np
def heavy_compute(x):
    # long comment …

    total = 0
    for i in range(1000):
        total += x * i
    return total
"""
result = compressor.compress(code, language="python")
print(result.compressed)          # → keeps imports, drops the heavy body

print(result.compression_ratio)   # → e.g. 0.48 (48 % size reduction)

```

### Compressing a Rust Module

The same class handles Rust code when the `language` parameter is set to `rust`:

```python

# Example 2 – compress a Rust module

from headroom.transforms import CodeAwareCompressor
compressor = CodeAwareCompressor()
rust_code = """
use std::collections::HashMap;
pub struct Data { pub id: i32, pub payload: String }
impl Data {
    pub fn new(id: i32, payload: String) -> Self { Self { id, payload } }
    pub fn compute(&self) -> i32 { (0..1000).map(|i| i * self.id).sum() }
}
"""
result = compressor.compress(rust_code, language="rust")
print(result.compressed)          # ← stripped the compute loop while keeping the struct

```

## Summary

- **Tree-sitter parsing** is required for AST-aware compression; the `CodeCompressor` verifies availability with `_check_tree_sitter_available()` before proceeding.
- **Language detection** is handled by `detect_language()`, while `_get_parser()` caches the correct tree-sitter grammar for each target language.
- **Per-language configuration** maps AST node types—such as `function_definition` for Python—to generic categories like functions and classes.
- **Structure extraction** via `_extract_structure()` and the `CodeStructure` dataclass creates a typed, language-agnostic view of imports, definitions, and comments.
- **Symbol scoring** (`_analyze_symbol_importance()`) and **budget allocation** (`_allocate_body_budget()`) determine which bodies to keep, truncate, or drop.
- **AST-subtree compressors** `_compress_function_ast()` and `_compress_class_ast()` perform the actual edits while preserving syntactic validity.
- The **`compress()`** method orchestrates the pipeline through `_compress_with_ast()` and returns a `DiffCompressionResult` with the final source string.

## Frequently Asked Questions

### What is AST-aware compression and why is it better than text-based minification?

AST-aware compression parses code into an abstract syntax tree before making edits, so it understands imports, classes, and function boundaries. Unlike plain text truncation, it can remove an entire low-value function body while keeping its signature and file-level imports intact, producing output that still compiles or runs.

### Which programming languages does Headroom's CodeCompressor support?

The compressor supports Python, JavaScript, TypeScript, Rust, C, C++, and Go according to the per-language configuration blocks in [`headroom/transforms/code_compressor.py`](https://github.com/chopratejas/headroom/blob/main/headroom/transforms/code_compressor.py). Each grammar defines node sets such as `function_nodes` and `class_nodes` that teach the engine how to read that language's AST.

### How does the compressor decide which functions or classes to keep or remove?

It uses `_analyze_symbol_importance()` to score every symbol based on call sites, references, and import usage. `_allocate_body_budget()` then maps those scores and a global token limit into per-definition allowances, directing `_compress_function_ast()` and `_compress_class_ast()` to preserve high-value nodes and prune low-value ones.

### Is the compressed output guaranteed to be syntactically valid?

The compressor reassembles output from edited AST nodes rather than raw text slices, so indentation, decorators, and closing braces remain balanced. Because the pipeline operates on tree-sitter nodes specific to each language, the resulting source string maintains valid syntax for the target grammar.