# How LiteParse Detects and Handles Text‑Sparse Pages for Selective OCR

> LiteParse smartly detects text-sparse pages using heuristics and runs selective OCR only where needed. It merges results, avoids duplicates, and cleans table artifacts for efficient document processing.

- Repository: [LlamaIndex/liteparse](https://github.com/run-llama/liteparse)
- Tags: how-to-guide
- Published: 2026-05-31

---

**LiteParse identifies text‑sparse pages using three heuristics—raw character count, text‑to‑page‑area coverage, and embedded images—and runs OCR only on those specific pages, merging results back while avoiding duplicates and cleaning table artifacts.**

The [run-llama/liteparse](https://github.com/run-llama/liteparse) library optimizes PDF parsing by avoiding unnecessary OCR on pages that already contain native text. Instead of processing every page through an OCR engine, LiteParse analyzes each page’s content during the preparation step to determine if optical character recognition is actually required. This selective approach significantly reduces processing time while ensuring that image‑based text and scanned documents are still fully captured.

## The Three Heuristics for Detecting Text‑Sparse Pages

LiteParse evaluates every extracted page against three specific metrics in [`crates/liteparse/src/ocr_merge.rs`](https://github.com/run-llama/liteparse/blob/main/crates/liteparse/src/ocr_merge.rs). If **any** of these conditions evaluates to true, the page is flagged as needing OCR and rendered to an RGB bitmap for processing.

### Raw Character Count

The system first calculates the total length of all extracted text strings on the page. If the combined character count falls below **20 characters**, the page is considered text‑sparse. This catches pages containing only headers, footers, or minimal incidental text that likely represents noise rather than substantive content.

### Text‑to‑Page Area Coverage

The second heuristic measures the ratio of the summed bounding‑box area of native text items to the total page area. If the coverage is less than **0.15** (15%), the page is marked for OCR. This threshold effectively identifies pages where text exists but represents a trivial portion of the layout, such as mostly image‑based slides or scanned diagrams with captions.

### Embedded Images

Finally, LiteParse checks for the presence of any raster images on the page. If **any** embedded image is detected, the page automatically qualifies for OCR review. This ensures that image‑based text—such as screenshots, scanned signatures, or infographic text—is captured even when the other heuristics might not trigger.

## Rendering and OCR Execution Pipeline

Once a page is identified as text‑sparse, the `render_pages_for_ocr` function in [`ocr_merge.rs`](https://github.com/run-llama/liteparse/blob/main/ocr_merge.rs) (lines 49‑66) renders the page to an RGB bitmap. This bitmap is then passed to the configured OCR backend.

The core decision logic resides at lines 28‑44 of [`ocr_merge.rs`](https://github.com/run-llama/liteparse/blob/main/ocr_merge.rs), where the heuristics are calculated and evaluated:

```rust
// Simplified logic from ocr_merge.rs#L28-L44
let text_len = page.text_items.iter().map(|t| t.text.len()).sum();
let coverage = calculate_text_coverage(&page.text_items, page.width, page.height);
let has_images = !page.images.is_empty();

if text_len < 20 || coverage < 0.15 || has_images {
    pages_needing_ocr.push(page);
}

```

The `ocr_and_merge_rendered` function (lines 71‑80) executes OCR concurrently using the selected engine—Tesseract (if compiled with the `tesseract` feature), an HTTP OCR server (if `ocr_server_url` is configured), or a custom implementation of the `OcrEngine` trait defined in [`crates/liteparse/src/ocr/mod.rs`](https://github.com/run-llama/liteparse/blob/main/crates/liteparse/src/ocr/mod.rs).

## Merging OCR Results Without Duplication

After OCR extracts text from the rendered bitmap, LiteParse merges these new text items back into the page’s existing `text_items` list. This process includes two critical cleanup steps to ensure data integrity:

**Overlap Detection:** The `overlaps_existing_text` function checks whether newly recognized text geometrically overlaps with existing native text elements. If overlap is detected, the duplicate OCR result is discarded to prevent double‑counting text that was already present in the PDF’s text layer.

**Artifact Cleaning:** The `clean_ocr_table_artifacts` function removes common OCR noise such as table borders and grid lines that Tesseract often interprets as text characters (like pipes `|` or dashes `-`). This ensures that structured data extraction downstream receives clean, accurate text.

## Configuration and Usage Examples

Selective OCR is enabled by default in `LiteParseConfig` (defined in [`crates/liteparse/src/config.rs`](https://github.com/run-llama/liteparse/blob/main/crates/liteparse/src/config.rs)). The following examples demonstrate how to parse documents with the default selective OCR behavior across different language bindings.

### Rust

```rust
use liteparse::LiteParse;
use liteparse::config::LiteParseConfig;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // OCR enabled by default—only text‑sparse pages trigger OCR
    let config = LiteParseConfig::default();
    let parser = LiteParse::new(config);

    // Parse PDF; OCR runs automatically on pages meeting heuristics
    let result = parser.parse("sample.pdf").await?;
    println!("Processed {} pages", result.pages.len());
    Ok(())
}

```

### Node.js / TypeScript

```typescript
import { LiteParse, LiteParseConfig } from "liteparse";

async function extractDocument() {
  const config = new LiteParseConfig(); // OCR enabled by default
  const parser = new LiteParse(config);
  
  const result = await parser.parse("sample.pdf");
  console.log(`Extracted text from ${result.pages.length} pages`);
  console.log(result.text);
}

extractDocument();

```

### Python

```python
from liteparse import LiteParse, LiteParseConfig

# Initialize with default config—selective OCR active

config = LiteParseConfig()
parser = LiteParse(config)

# Parse with automatic sparse page detection

result = parser.parse("sample.pdf")
print(f"Pages processed: {len(result.pages)}")
print(result.text)

```

The [`parser.rs`](https://github.com/run-llama/liteparse/blob/main/parser.rs) file orchestrates the extraction flow, while [`projection.rs`](https://github.com/run-llama/liteparse/blob/main/projection.rs) handles the final text layout after OCR results are integrated, ensuring that text coordinates align correctly with the original document structure.

## Summary

- LiteParse uses **three heuristics** to detect text‑sparse pages: character count (< 20), text coverage (< 15%), and presence of embedded images.
- The decision logic lives in [`ocr_merge.rs`](https://github.com/run-llama/liteparse/blob/main/ocr_merge.rs) (lines 28‑44), with rendering handled at lines 49‑66.
- Only flagged pages are rendered to bitmap and processed by the OCR engine (Tesseract, HTTP server, or custom).
- Post‑OCR merging uses `overlaps_existing_text` to prevent duplication and `clean_ocr_table_artifacts` to remove noise.
- Selective OCR is **enabled by default** via `LiteParseConfig` in [`config.rs`](https://github.com/run-llama/liteparse/blob/main/config.rs).

## Frequently Asked Questions

### What determines if a page is considered text‑sparse in LiteParse?

A page is flagged as text‑sparse if it meets **any** of three criteria: fewer than 20 total characters, less than 15% of the page area covered by text bounding boxes, or the presence of any embedded raster images. These thresholds are evaluated in [`ocr_merge.rs`](https://github.com/run-llama/liteparse/blob/main/ocr_merge.rs) at line 44.

### How does LiteParse prevent duplicate text when merging OCR results?

The `overlaps_existing_text` function in [`ocr_merge.rs`](https://github.com/run-llama/liteparse/blob/main/ocr_merge.rs) performs geometric overlap detection between newly OCR’d text items and existing native text. If an OCR result occupies the same bounding box area as existing text, it is discarded to avoid duplication.

### Can I force OCR on every page instead of using selective OCR?

Yes. While selective OCR is enabled by default, you can configure the OCR behavior through `LiteParseConfig` in [`config.rs`](https://github.com/run-llama/liteparse/blob/main/config.rs). Disabling the selective heuristics or adjusting the thresholds would require modifying the configuration or using custom preprocessing, though the default behavior targets only pages that genuinely need OCR.

### Which OCR engines are supported by LiteParse?

LiteParse supports **Tesseract** (when compiled with the appropriate feature flag), **HTTP OCR servers** (configured via `ocr_server_url`), and **custom engines** implementing the `OcrEngine` trait defined in [`crates/liteparse/src/ocr/mod.rs`](https://github.com/run-llama/liteparse/blob/main/crates/liteparse/src/ocr/mod.rs). The engine is selected automatically based on availability and configuration.