How to Debug Why LiteParse Is Missing Text from a Specific PDF

LiteParse drops text when PDFs contain invisible OCR layers, zero-height glyphs, or overlapping items, and the projection.rs layout engine can discard rotated or anchored content during reconstruction.

Understanding why text disappears requires tracing through the aggressive filtering pipeline in the run-llama/liteparse repository. LiteParse intentionally removes noisy elements—such as phantom dots and duplicate OCR text—but these heuristics can also eliminate legitimate content. By enabling debug mode and inspecting the source code, you can identify exactly which stage filters your specific document.

Understanding the Text Extraction Pipeline

Text extraction follows a sequential pipeline defined in crates/liteparse/src/extract.rs and crates/liteparse/src/projection.rs. Data can be discarded at any of these stages:

  • Loading: Opens the PDF via pdfium::Document using load_document_from_input (lines 5-14 in extract.rs)
  • Raw Character Scan: extract_page_text_items (lines 50-150) iterates every TextChar, groups segments, and builds TextItem objects
  • Invisible Text Filtering: should_skip_invisible (lines 86-100) removes render-mode 3 characters when fewer than 30% of characters are visible
  • Zero-Height/Generated Glyph Skipping: Drops phantom dots and synthetic glyphs where vp_loose.bottom - vp_loose.top < 0.5 or is_generated is true (lines 57-69)
  • Ligature Expansion: Splits ligatures (ff, fi, etc.) into constituent letters
  • Deduplication: dedup_overlapping_items (lines 72-58) removes exact matches and overlapping items using a 0.5 intersection ratio
  • Projection: handle_rotation_reading_order and anchor detection in projection.rs reconstruct layout, handling rotations and flow-text detection
  • Output: Serializes LitePage to JSON or plain text via LiteParse::parse

Common Causes of Missing Text

Invisible OCR Layers Discarded

When should_skip_invisible calculates that fewer than 30% of characters are visible (lines 124-136 in extract.rs), it assumes the page contains only invisible OCR text and returns true, causing the entire layer to be skipped. This frequently occurs with scanned PDFs that contain mixed native and OCR text.

Zero-Height and Generated Glyphs

The engine filters characters where the bounding box height is less than 0.5 points or where the glyph is algorithmically generated. These often include dot-leader decorations in tables of contents, but can also include legitimate small punctuation or footnote markers.

Over-Aggressive Deduplication

dedup_overlapping_items removes items where intersection / smaller_area > 0.5. Table cells that overlap with labels or headers may be incorrectly flagged as duplicates and removed, particularly when the size-ratio guard (lines 224-227) fails to distinguish between intentional duplicates and distinct overlapping elements.

Rotation Handling Errors

In projection.rs, handle_rotation_reading_order (lines 13-30) groups items by canonical_rotation. If global_overlap returns false at line 88 for a 90° or 270° group, the coordinate rewrite shifts the bounding box, potentially moving rotated text off-page where it is excluded from final output.

Anchor Filtering in Multi-Column Layouts

extract_block_anchors (around line 92 in projection.rs) applies delta_min_filter (lines 34-65) and intercept_filter (lines 84-131). If these filters determine that a column's left anchor is "intercepted" by other text, the entire column may collapse or merge into adjacent content.

Flow-Text Misclassification

is_flowing_text_block (lines 14-27 in projection.rs) uses FLOWING_WIDE_LINE_RATIO to detect paragraphs. Tables with wide cells may trigger this heuristic, causing detect_and_render_flowing_lines to concatenate distinct cells into a single flowing paragraph.

Step-by-Step Debugging Procedure

Enable Verbose Debug Output

Set the environment variable to expose filtering decisions at every stage:

export LITEPARSE_DEBUG=1
liteparse extract problem.pdf

The debug prints (lines 70-75 in extract.rs) reveal char_count, skip_invisible decisions, and specific rejection reasons including "sentinel," "zero-height," "generated," and "invisible."

Inspect Raw TextItems Before Deduplication

Add instrumentation immediately after extract_page_text_items returns (line 69 in extract.rs) to see items before the deduplication stage removes them:

let items = extract_page_text_items(&page, &text_page, &view_box)?;
eprintln!("RAW items (pre-dedup): {:#?}", items);

Compare this output against the final result to determine if items disappear during deduplication or projection.

Validate Invisible Text Thresholds

Examine the ratio calculation at lines 124-136 in should_skip_invisible. If your document contains a mix of native text and OCR layers and the native text is disappearing, adjust the constant invisible_ratio < 0.3 to a higher threshold (such as 0.5) or expose the parameter via LiteParseConfig.

Check Deduplication Logic

Watch for [extract-debug] DEDUP exact-match drop messages in the debug output. If legitimate items are removed, modify the overlap ratio check at line 188 from 0.5 to 0.3, or adjust the area-ratio guard at lines 224-227 to be less restrictive than the current >5.0 factor.

Verify Rotation Grouping

Print each group's canonical_rotation and global_overlap result before the check at line 88 in handle_rotation_reading_order. If global_overlap is false unexpectedly, the coordinate transformation at lines 90-95 shifts the text bounding box. Increase the proximity margin at line 85 or skip the rewrite for specific rotation angles.

Examine Anchor Creation

Log the output from extract_block_anchors (around line 92 in projection.rs) to identify if delta_min_filter or intercept_filter removes column boundaries. Temporarily disable these filters by commenting out lines 84-131 to determine if anchor loss causes column collapse.

Review Flow-Text Heuristics

If table rows merge into single lines, lower FLOWING_WIDE_LINE_RATIO or increase FLOWING_MAX_TOTAL_ANCHORS in is_flowing_text_block to prevent the table from being classified as a flowing paragraph and processed by detect_and_render_flowing_lines.

Practical Debugging Examples

CLI with Debug Logging

export LITEPARSE_DEBUG=1
cargo run --release --bin liteparse -- extract ./samples/problem.pdf

Node.js Wrapper

process.env.LITEPARSE_DEBUG = '1';
import { LiteParse } from 'liteparse';

const parser = new LiteParse();
const result = await parser.parseFile('problem.pdf');
console.log(JSON.stringify(result, null, 2));

Python Wrapper

import os
os.environ["LITEPARSE_DEBUG"] = "1"
from liteparse import LiteParse

parser = LiteParse()
pages = parser.parse_file("problem.pdf")
print(pages)

Rust Test Harness for Raw Items

#[tokio::test]
async fn dump_raw_items() -> Result<(), LiteParseError> {
    let pdf_path = "tests/data/missing_text.pdf";
    let pages = LiteParse::parse_file(pdf_path).await?;
    for (i, page) in pages.iter().enumerate() {
        eprintln!("--- Page {} raw items ---", i + 1);
        for item in &page.text_items {
            eprintln!("{:?}", item);
        }
    }
    Ok(())
}

Run with cargo test -- --nocapture to view the raw extraction before filtering.

Key Source Files to Examine

  • crates/liteparse/src/extract.rs: Contains should_skip_invisible, extract_page_text_items, and dedup_overlapping_items—the primary locations where characters are filtered during early extraction
  • crates/liteparse/src/projection.rs: Houses handle_rotation_reading_order, extract_block_anchors, and is_flowing_text_block—responsible for layout reconstruction and flow-text detection that can drop structured content
  • crates/liteparse/src/types.rs: Defines TextItem and ProjectedTextItem structures to understand which metadata fields are populated at each stage
  • crates/liteparse/src/config.rs: User-facing configuration; future versions may expose the invisible-text ratio threshold here

Summary

  • Enable LITEPARSE_DEBUG=1 to trace filtering decisions through extract.rs and identify the specific stage removing your text
  • Inspect should_skip_invisible when entire pages are blank; the default 30% visibility threshold may incorrectly discard mixed native/OCR documents
  • Verify zero-height filtering (lines 57-69 in extract.rs) when specific characters disappear; the 0.5 point minimum height may exclude legitimate small glyphs
  • Audit dedup_overlapping_items if table cells or labels vanish; relax the 0.5 intersection ratio or adjust the size-ratio guard at lines 224-227
  • Debug rotation with handle_rotation_reading_order in projection.rs when vertical text disappears due to failed global_overlap checks
  • Review anchor filters (intercept_filter, delta_min_filter) when multi-column layouts collapse into single streams
  • Adjust flow-text constants (FLOWING_WIDE_LINE_RATIO) if structured tables incorrectly merge into paragraphs

Frequently Asked Questions

Why does LiteParse skip entire pages of my scanned PDF?

LiteParse's should_skip_invisible function (lines 86-100 in extract.rs) calculates the ratio of visible to invisible characters. If fewer than 30% of characters are visible (render-mode 3), the function returns true and the entire text layer is discarded. For scanned documents containing OCR layers over images, you may need to adjust the invisible_ratio constant at line 125 or disable invisible-text filtering via a custom configuration.

How do I stop LiteParse from removing small text or punctuation?

The engine filters characters where vp_loose.bottom - vp_loose.top < 0.5 points, approximately at lines 57-69 in extract.rs. This removes phantom glyphs and dot-leaders but can also discard legitimate small text. Modify the height threshold in the source code or check the debug output for "SKIP zero-height" messages to confirm this is the cause of your missing content.

Why is my table content merging into one paragraph?

The is_flowing_text_block heuristic in projection.rs (lines 14-27) detects wide-line ratios typical of flowing paragraphs. If table cells trigger this heuristic, detect_and_render_flowing_lines merges the cells. Lower the FLOWING_WIDE_LINE_RATIO constant or increase FLOWING_MAX_TOTAL_ANCHORS to ensure tables are treated as distinct structured blocks rather than flowing text.

Can I disable the duplicate removal entirely?

While there is no configuration flag to disable dedup_overlapping_items, you can effectively disable it by modifying the overlap ratio check at line 188 in extract.rs from intersection / smaller_area > 0.5 to > 1.0 (which is impossible), or by increasing the size-ratio guard threshold at lines 224-227. Recompile the crate after making these adjustments.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →