# How to Debug Why LiteParse Is Missing Text from a Specific PDF

> Fix LiteParse missing PDF text by debugging invisible OCR layers, zero-height glyphs, or overlapping items. Learn how the projection.rs engine handles content.

- Repository: [LlamaIndex/liteparse](https://github.com/run-llama/liteparse)
- Tags: how-to-guide
- Published: 2026-05-30

---

**LiteParse drops text when PDFs contain invisible OCR layers, zero-height glyphs, or overlapping items, and the [`projection.rs`](https://github.com/run-llama/liteparse/blob/main/projection.rs) layout engine can discard rotated or anchored content during reconstruction.**

Understanding why text disappears requires tracing through the aggressive filtering pipeline in the `run-llama/liteparse` repository. LiteParse intentionally removes noisy elements—such as phantom dots and duplicate OCR text—but these heuristics can also eliminate legitimate content. By enabling debug mode and inspecting the source code, you can identify exactly which stage filters your specific document.

## Understanding the Text Extraction Pipeline

Text extraction follows a sequential pipeline defined in [`crates/liteparse/src/extract.rs`](https://github.com/run-llama/liteparse/blob/main/crates/liteparse/src/extract.rs) and [`crates/liteparse/src/projection.rs`](https://github.com/run-llama/liteparse/blob/main/crates/liteparse/src/projection.rs). Data can be discarded at any of these stages:

- **Loading**: Opens the PDF via `pdfium::Document` using `load_document_from_input` (lines 5-14 in [`extract.rs`](https://github.com/run-llama/liteparse/blob/main/extract.rs))
- **Raw Character Scan**: `extract_page_text_items` (lines 50-150) iterates every `TextChar`, groups segments, and builds `TextItem` objects
- **Invisible Text Filtering**: `should_skip_invisible` (lines 86-100) removes render-mode 3 characters when fewer than 30% of characters are visible
- **Zero-Height/Generated Glyph Skipping**: Drops phantom dots and synthetic glyphs where `vp_loose.bottom - vp_loose.top < 0.5` or `is_generated` is true (lines 57-69)
- **Ligature Expansion**: Splits ligatures (ff, fi, etc.) into constituent letters
- **Deduplication**: `dedup_overlapping_items` (lines 72-58) removes exact matches and overlapping items using a 0.5 intersection ratio
- **Projection**: `handle_rotation_reading_order` and anchor detection in [`projection.rs`](https://github.com/run-llama/liteparse/blob/main/projection.rs) reconstruct layout, handling rotations and flow-text detection
- **Output**: Serializes `LitePage` to JSON or plain text via `LiteParse::parse`

## Common Causes of Missing Text

### Invisible OCR Layers Discarded

When `should_skip_invisible` calculates that fewer than 30% of characters are visible (lines 124-136 in [`extract.rs`](https://github.com/run-llama/liteparse/blob/main/extract.rs)), it assumes the page contains only invisible OCR text and returns `true`, causing the entire layer to be skipped. This frequently occurs with scanned PDFs that contain mixed native and OCR text.

### Zero-Height and Generated Glyphs

The engine filters characters where the bounding box height is less than 0.5 points or where the glyph is algorithmically generated. These often include dot-leader decorations in tables of contents, but can also include legitimate small punctuation or footnote markers.

### Over-Aggressive Deduplication

`dedup_overlapping_items` removes items where `intersection / smaller_area > 0.5`. Table cells that overlap with labels or headers may be incorrectly flagged as duplicates and removed, particularly when the size-ratio guard (lines 224-227) fails to distinguish between intentional duplicates and distinct overlapping elements.

### Rotation Handling Errors

In [`projection.rs`](https://github.com/run-llama/liteparse/blob/main/projection.rs), `handle_rotation_reading_order` (lines 13-30) groups items by `canonical_rotation`. If `global_overlap` returns `false` at line 88 for a 90° or 270° group, the coordinate rewrite shifts the bounding box, potentially moving rotated text off-page where it is excluded from final output.

### Anchor Filtering in Multi-Column Layouts

`extract_block_anchors` (around line 92 in [`projection.rs`](https://github.com/run-llama/liteparse/blob/main/projection.rs)) applies `delta_min_filter` (lines 34-65) and `intercept_filter` (lines 84-131). If these filters determine that a column's left anchor is "intercepted" by other text, the entire column may collapse or merge into adjacent content.

### Flow-Text Misclassification

`is_flowing_text_block` (lines 14-27 in [`projection.rs`](https://github.com/run-llama/liteparse/blob/main/projection.rs)) uses `FLOWING_WIDE_LINE_RATIO` to detect paragraphs. Tables with wide cells may trigger this heuristic, causing `detect_and_render_flowing_lines` to concatenate distinct cells into a single flowing paragraph.

## Step-by-Step Debugging Procedure

### Enable Verbose Debug Output

Set the environment variable to expose filtering decisions at every stage:

```bash
export LITEPARSE_DEBUG=1
liteparse extract problem.pdf

```

The debug prints (lines 70-75 in [`extract.rs`](https://github.com/run-llama/liteparse/blob/main/extract.rs)) reveal `char_count`, `skip_invisible` decisions, and specific rejection reasons including "sentinel," "zero-height," "generated," and "invisible."

### Inspect Raw TextItems Before Deduplication

Add instrumentation immediately after `extract_page_text_items` returns (line 69 in [`extract.rs`](https://github.com/run-llama/liteparse/blob/main/extract.rs)) to see items before the deduplication stage removes them:

```rust
let items = extract_page_text_items(&page, &text_page, &view_box)?;
eprintln!("RAW items (pre-dedup): {:#?}", items);

```

Compare this output against the final result to determine if items disappear during deduplication or projection.

### Validate Invisible Text Thresholds

Examine the ratio calculation at lines 124-136 in `should_skip_invisible`. If your document contains a mix of native text and OCR layers and the native text is disappearing, adjust the constant `invisible_ratio < 0.3` to a higher threshold (such as `0.5`) or expose the parameter via `LiteParseConfig`.

### Check Deduplication Logic

Watch for `[extract-debug] DEDUP exact-match drop` messages in the debug output. If legitimate items are removed, modify the overlap ratio check at line 188 from `0.5` to `0.3`, or adjust the area-ratio guard at lines 224-227 to be less restrictive than the current `>5.0` factor.

### Verify Rotation Grouping

Print each group's `canonical_rotation` and `global_overlap` result before the check at line 88 in `handle_rotation_reading_order`. If `global_overlap` is `false` unexpectedly, the coordinate transformation at lines 90-95 shifts the text bounding box. Increase the proximity `margin` at line 85 or skip the rewrite for specific rotation angles.

### Examine Anchor Creation

Log the output from `extract_block_anchors` (around line 92 in [`projection.rs`](https://github.com/run-llama/liteparse/blob/main/projection.rs)) to identify if `delta_min_filter` or `intercept_filter` removes column boundaries. Temporarily disable these filters by commenting out lines 84-131 to determine if anchor loss causes column collapse.

### Review Flow-Text Heuristics

If table rows merge into single lines, lower `FLOWING_WIDE_LINE_RATIO` or increase `FLOWING_MAX_TOTAL_ANCHORS` in `is_flowing_text_block` to prevent the table from being classified as a flowing paragraph and processed by `detect_and_render_flowing_lines`.

## Practical Debugging Examples

### CLI with Debug Logging

```bash
export LITEPARSE_DEBUG=1
cargo run --release --bin liteparse -- extract ./samples/problem.pdf

```

### Node.js Wrapper

```javascript
process.env.LITEPARSE_DEBUG = '1';
import { LiteParse } from 'liteparse';

const parser = new LiteParse();
const result = await parser.parseFile('problem.pdf');
console.log(JSON.stringify(result, null, 2));

```

### Python Wrapper

```python
import os
os.environ["LITEPARSE_DEBUG"] = "1"
from liteparse import LiteParse

parser = LiteParse()
pages = parser.parse_file("problem.pdf")
print(pages)

```

### Rust Test Harness for Raw Items

```rust
#[tokio::test]
async fn dump_raw_items() -> Result<(), LiteParseError> {
    let pdf_path = "tests/data/missing_text.pdf";
    let pages = LiteParse::parse_file(pdf_path).await?;
    for (i, page) in pages.iter().enumerate() {
        eprintln!("--- Page {} raw items ---", i + 1);
        for item in &page.text_items {
            eprintln!("{:?}", item);
        }
    }
    Ok(())
}

```

Run with `cargo test -- --nocapture` to view the raw extraction before filtering.

## Key Source Files to Examine

- **[`crates/liteparse/src/extract.rs`](https://github.com/run-llama/liteparse/blob/main/crates/liteparse/src/extract.rs)**: Contains `should_skip_invisible`, `extract_page_text_items`, and `dedup_overlapping_items`—the primary locations where characters are filtered during early extraction
- **[`crates/liteparse/src/projection.rs`](https://github.com/run-llama/liteparse/blob/main/crates/liteparse/src/projection.rs)**: Houses `handle_rotation_reading_order`, `extract_block_anchors`, and `is_flowing_text_block`—responsible for layout reconstruction and flow-text detection that can drop structured content
- **[`crates/liteparse/src/types.rs`](https://github.com/run-llama/liteparse/blob/main/crates/liteparse/src/types.rs)**: Defines `TextItem` and `ProjectedTextItem` structures to understand which metadata fields are populated at each stage
- **[`crates/liteparse/src/config.rs`](https://github.com/run-llama/liteparse/blob/main/crates/liteparse/src/config.rs)**: User-facing configuration; future versions may expose the invisible-text ratio threshold here

## Summary

- **Enable `LITEPARSE_DEBUG=1`** to trace filtering decisions through [`extract.rs`](https://github.com/run-llama/liteparse/blob/main/extract.rs) and identify the specific stage removing your text
- **Inspect `should_skip_invisible`** when entire pages are blank; the default 30% visibility threshold may incorrectly discard mixed native/OCR documents
- **Verify zero-height filtering** (lines 57-69 in [`extract.rs`](https://github.com/run-llama/liteparse/blob/main/extract.rs)) when specific characters disappear; the 0.5 point minimum height may exclude legitimate small glyphs
- **Audit `dedup_overlapping_items`** if table cells or labels vanish; relax the 0.5 intersection ratio or adjust the size-ratio guard at lines 224-227
- **Debug rotation with `handle_rotation_reading_order`** in [`projection.rs`](https://github.com/run-llama/liteparse/blob/main/projection.rs) when vertical text disappears due to failed `global_overlap` checks
- **Review anchor filters** (`intercept_filter`, `delta_min_filter`) when multi-column layouts collapse into single streams
- **Adjust flow-text constants** (`FLOWING_WIDE_LINE_RATIO`) if structured tables incorrectly merge into paragraphs

## Frequently Asked Questions

### Why does LiteParse skip entire pages of my scanned PDF?

LiteParse's `should_skip_invisible` function (lines 86-100 in [`extract.rs`](https://github.com/run-llama/liteparse/blob/main/extract.rs)) calculates the ratio of visible to invisible characters. If fewer than 30% of characters are visible (render-mode 3), the function returns `true` and the entire text layer is discarded. For scanned documents containing OCR layers over images, you may need to adjust the `invisible_ratio` constant at line 125 or disable invisible-text filtering via a custom configuration.

### How do I stop LiteParse from removing small text or punctuation?

The engine filters characters where `vp_loose.bottom - vp_loose.top < 0.5` points, approximately at lines 57-69 in [`extract.rs`](https://github.com/run-llama/liteparse/blob/main/extract.rs). This removes phantom glyphs and dot-leaders but can also discard legitimate small text. Modify the height threshold in the source code or check the debug output for "SKIP zero-height" messages to confirm this is the cause of your missing content.

### Why is my table content merging into one paragraph?

The `is_flowing_text_block` heuristic in [`projection.rs`](https://github.com/run-llama/liteparse/blob/main/projection.rs) (lines 14-27) detects wide-line ratios typical of flowing paragraphs. If table cells trigger this heuristic, `detect_and_render_flowing_lines` merges the cells. Lower the `FLOWING_WIDE_LINE_RATIO` constant or increase `FLOWING_MAX_TOTAL_ANCHORS` to ensure tables are treated as distinct structured blocks rather than flowing text.

### Can I disable the duplicate removal entirely?

While there is no configuration flag to disable `dedup_overlapping_items`, you can effectively disable it by modifying the overlap ratio check at line 188 in [`extract.rs`](https://github.com/run-llama/liteparse/blob/main/extract.rs) from `intersection / smaller_area > 0.5` to `> 1.0` (which is impossible), or by increasing the size-ratio guard threshold at lines 224-227. Recompile the crate after making these adjustments.