How CodeAwareCompressor Preserves Syntax Validity When Compressing Source Code
CodeAwareCompressor guarantees syntactically valid output by operating on the Abstract Syntax Tree (AST) rather than raw text, then validating the result with a second Tree-sitter parse before returning.
Headroom's CodeAwareCompressor (located in headroom/transforms/code_compressor.py) solves the challenge of reducing token count without breaking code execution. Unlike text-based minifiers that risk creating unparseable output, this compressor treats source code as a structured tree, ensuring that every transformation maintains the grammatical correctness required by compilers and interpreters.
AST-Based Architecture for Guaranteed Validity
The compressor achieves syntax-safe reduction by working at the AST level. This approach eliminates text-based heuristics that might accidentally drop required syntax tokens.
Tree-sitter Integration and Thread Safety
The foundation of reliable parsing lies in Tree-sitter integration. The compressor creates a thread-local parser instance through _get_parser (lines 75-96), caching the parser per thread to ensure thread-safe operation without shared state mutation. This guarantees that parallel execution in thread pools never corrupts the parsing state.
Language Detection with Error Minimization
Before transformation, detect_language (lines 622-698) identifies the source language by running lightweight regex pre-filters (_LANGUAGE_PREFILTER) and parsing with candidate grammars. It selects the language producing the fewest error nodes via _count_error_nodes, returning a confidence score that ensures the correct grammar is applied before any AST manipulation begins.
The Compression Workflow
Once language detection completes, the compressor extracts and analyzes code structure to determine what can safely be removed while preserving essential interfaces.
Structure Extraction via AST Walking
The _extract_structure method (lines 610-678) walks the AST and collects critical nodes into a CodeStructure object. Because this extraction works directly on the tree representation, it accurately preserves:
- Import statements
- Type and class definitions
- Function signatures
- Docstrings (according to configured
DocstringMode) - Comments and top-level code
Symbol Importance Analysis
A lightweight intra-file analysis performed by _analyze_symbol_importance (lines 810-928) scores every symbol—functions, classes, and methods—based on reference count, fan-out, visibility, and optional context matching. These scores, normalized to a 0-1 range, drive the budget allocation process to ensure high-importance symbols retain more content.
Body-Budget Allocation and AST Trimming
Using the target compression rate, _allocate_body_budget (lines 970-1008) distributes line budgets across symbols. High-importance symbols receive more lines, while low-importance ones are truncated aggressively. The compressor then isolates body nodes and trims them to the allowed line count via _compress_function_ast and _compress_class_ast.
Guaranteed Syntax Validation
The compressor implements multiple safety mechanisms to ensure output validity.
Post-Compression Validation
After assembling the compressed source, the compressor parses the result again with Tree-sitter and executes _count_error_nodes (lines 1012-1025). If any ERROR or MISSING nodes are detected, the compressor discards the compressed output and returns the original source, ensuring nothing invalid is ever emitted.
Graceful Fallback Mechanisms
If Tree-sitter is unavailable or the language is unknown, the compressor optionally falls back to the generic Kompress compressor via _fallback_compress rather than risking syntax errors. This defensive design prevents malformed output when the AST-based approach cannot safely execute.
Implementation and Usage
The following example demonstrates basic usage of the CodeAwareCompressor:
from headroom.transforms import CodeAwareCompressor
compressor = CodeAwareCompressor()
result = compressor.compress('''
import os
from typing import List
def process_data(items: List[str]) -> List[str]:
"""Process a list of items."""
results = []
for item in items:
if not item:
continue
processed = item.strip().lower()
results.append(processed)
return results
''')
print(result.compressed) # Valid Python code, bodies compressed
print(result.syntax_valid) # → True
Inspecting Compression Metadata
The result object provides detailed metadata about the compression process:
print(result.compression_ratio) # e.g. 0.42
print(result.preserved_imports) # 2
print(result.preserved_signatures) # 1
print(result.symbol_scores) # {'process_data': 0.87}
Activating via Command Line
Enable the compressor in the Headroom proxy pipeline using:
headroom proxy --code-aware
This flag activates the transformer in the content router (headroom/transforms/content_router.py), routing source code through the AST-based compression pipeline.
Summary
- AST-Based Processing:
CodeAwareCompressoroperates on the Abstract Syntax Tree inheadroom/transforms/code_compressor.pyrather than raw text, preventing syntax-breaking text manipulation. - Dual-Parse Validation: The compressor validates output via a second Tree-sitter parse (
_count_error_nodes) and discards invalid results before return. - Structure Preservation: The
_extract_structuremethod ensures imports, signatures, and type definitions remain intact while only function bodies are compressed. - Intelligent Budgeting: Symbol importance scoring (
_analyze_symbol_importance) and budget allocation (_allocate_body_budget) prioritize critical code sections. - Thread-Safe Operation: Thread-local parser caching via
_get_parserenables safe parallel execution without shared state mutation. - Defensive Fallback: Unknown languages or missing dependencies trigger
_fallback_compressto avoid syntax errors.
Frequently Asked Questions
How does CodeAwareCompressor handle multiple programming languages?
The compressor uses data-driven language configuration via _LANG_CONFIGS (lines 200-229), where each LangConfig defines AST node types for imports, functions, and classes. Language detection runs candidate parsers through detect_language and selects the grammar producing the fewest error nodes, ensuring correct parsing regardless of whether the source is Python, JavaScript, or another supported language.
What happens if the compressed code contains syntax errors?
The compressor runs post-compression validation using _count_error_nodes (lines 1012-1025) to check for ERROR or MISSING nodes in the Tree-sitter AST. If any errors are detected, the compressor discards the compressed result and returns the original source code unchanged, guaranteeing that result.syntax_valid is always True for any returned compressed output.
Can CodeAwareCompressor run safely in multi-threaded environments?
Yes. The implementation uses thread-local storage through _get_parser (lines 75-96) to cache Tree-sitter parsers per thread. This design avoids shared state mutation, making the compressor safe for parallel execution in thread pools while maintaining high performance through parser reuse within each thread.
Which parts of the code does the compressor preserve versus remove?
The compressor preserves all structural elements including import statements, function signatures, class definitions, decorators, and return type annotations. It removes or truncates only function and method bodies based on the allocated line budget from _allocate_body_budget (lines 970-1008), ensuring the public API and type signatures remain intact and callable.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →