# How LiteParse Detects and Handles Fonts with Buggy Encoding in PDFs

> Learn how LiteParse detects and handles buggy PDF font encoding by validating naming patterns and metadata, then sanitizing text. Improve your PDF text extraction.

- Repository: [LlamaIndex/liteparse](https://github.com/run-llama/liteparse)
- Tags: how-to-guide
- Published: 2026-05-30

---

**LiteParse detects fonts with buggy encoding by validating PostScript naming patterns and FontType metadata in [`extract.rs`](https://github.com/run-llama/liteparse/blob/main/extract.rs), then sanitizes extracted text by filtering characters from control and private-use Unicode ranges.**

Extracting clean text from PDFs requires defensive handling of malformed font embedding, where encoding tables map glyphs to incorrect Unicode points or reserved ranges. The **run-llama/liteparse** crate implements specific heuristics during the character-by-character extraction phase to identify these problematic fonts and prevent encoding errors from corrupting the output.

## Font Metadata Detection in extract.rs

The primary detection logic resides in [`crates/liteparse/src/extract.rs`](https://github.com/run-llama/liteparse/blob/main/crates/liteparse/src/extract.rs), specifically within the `is_buggy_font` helper function (lines 501–515). This function applies vendor-specific pattern matching to flag fonts likely to produce encoding errors before text extraction begins.

### TrueType Subset Identification

TrueType subset fonts frequently exhibit encoding issues when embedded by PDF producers. LiteParse identifies these by examining the font’s PostScript name for two specific patterns:

- Names beginning with the prefix `TT`
- Names containing the substring `+TT`

These patterns indicate subsetted TrueType fonts that often lack proper **ToUnicode** mapping tables, causing character substitution errors during extraction.

### Type 1 Font Validation

For PostScript Type 1 fonts, LiteParse applies a distinct heuristic based on naming conventions. The code checks for PostScript names matching the pattern of a **six-character prefix followed immediately by an underscore** (`_`). This structure typically indicates a problematic subsetted Type 1 font with incomplete or corrupted encoding vectors.

## Character-Level Sanitization

Beyond metadata inspection, LiteParse performs runtime validation of individual code points to catch encoding errors that survive initial font detection.

### Control Character Filtering

The library identifies suspicious code points in the **ASCII control range** (≤ `0x1F`). PDF producers sometimes pack visual characters into these reserved slots when standard encoding tables are missing or corrupted, resulting in semantically incorrect text extraction.

### Private-Use Area Detection

LiteParse treats characters in the **Unicode Private-Use Area (PUA)**, specifically range `U+E000`–`U+F8FF`, as potentially buggy. These code points lack standard glyph definitions across platforms and often indicate embedding errors where the font’s internal encoding leaked into the public Unicode space.

## Integration with the Extraction Pipeline

These checks integrate directly into the character extraction iteration. When `is_buggy_font` returns true for a given font, or when character-level validation detects problematic Unicode values, LiteParse sanitizes the output stream by excluding the affected glyphs.

The detection functions operate as follows:

```rust
// Located in crates/liteparse/src/extract.rs (lines 501-515)
fn is_buggy_font(font: &Font) -> bool {
    let name = font.postscript_name().unwrap_or("");
    
    // TrueType subset detection
    if name.starts_with("TT") || name.contains("+TT") {
        return true;
    }
    
    // Type 1 six-char prefix + underscore pattern
    if font.font_type() == FontType::Type1 && 
       name.len() > 7 && 
       &name[6..7] == "_" {
        return true;
    }
    
    false
}

fn is_buggy_code_point(code: u32) -> bool {
    // Control characters or Private-Use Area
    code <= 0x1F || (0xE000..=0xF8FF).contains(&code)
}

```

## Summary

- **LiteParse** detects fonts with buggy encoding in [`crates/liteparse/src/extract.rs`](https://github.com/run-llama/liteparse/blob/main/crates/liteparse/src/extract.rs) using the `is_buggy_font` helper at lines 501–515.
- **TrueType subsets** are flagged by `TT` prefixes or `+TT` substrings in PostScript names.
- **Type 1 fonts** with six-character prefixes followed by underscores are marked as unreliable.
- **Character-level filtering** removes control characters (≤ `0x1F`) and Private-Use Area code points (`U+E000`–`U+F8FF`).
- These heuristics execute during character-by-character extraction to sanitize PDF text output.

## Frequently Asked Questions

### What defines a "buggy" font in LiteParse?

A buggy font lacks proper Unicode mapping tables or embeds characters in non-standard code ranges. According to the run-llama/liteparse source code, LiteParse specifically targets TrueType subsets with `TT` naming patterns and Type 1 fonts with truncated PostScript names, as these commonly produce garbled text during extraction.

### Why does LiteParse filter Private-Use Area characters?

PDF producers sometimes map visual glyphs to PUA code points (`U+E000`–`U+F8FF`) when standard encoding tables are corrupted or missing. Since these characters lack universal meaning and render unpredictably across systems, LiteParse treats them as encoding errors and removes them to ensure text consistency.

### How does LiteParse handle control characters during extraction?

Control characters (ASCII values ≤ `0x1F`) are filtered because PDF font bugs often place printable glyphs in these reserved slots. This prevents non-printable or semantically incorrect characters from appearing in the final extracted text.

### What is the difference between TrueType and Type 1 buggy font detection?

TrueType detection relies on PostScript name patterns (`TT` prefix or `+TT` substring), while Type 1 detection checks for a specific six-character prefix followed by an underscore. These distinct patterns reflect different subsetting strategies used by PDF creation tools for each font format.