How LiteParse Detects and Handles Text‑Sparse Pages for Selective OCR

LiteParse identifies text‑sparse pages using three heuristics—raw character count, text‑to‑page‑area coverage, and embedded images—and runs OCR only on those specific pages, merging results back while avoiding duplicates and cleaning table artifacts.

The run-llama/liteparse library optimizes PDF parsing by avoiding unnecessary OCR on pages that already contain native text. Instead of processing every page through an OCR engine, LiteParse analyzes each page’s content during the preparation step to determine if optical character recognition is actually required. This selective approach significantly reduces processing time while ensuring that image‑based text and scanned documents are still fully captured.

The Three Heuristics for Detecting Text‑Sparse Pages

LiteParse evaluates every extracted page against three specific metrics in crates/liteparse/src/ocr_merge.rs. If any of these conditions evaluates to true, the page is flagged as needing OCR and rendered to an RGB bitmap for processing.

Raw Character Count

The system first calculates the total length of all extracted text strings on the page. If the combined character count falls below 20 characters, the page is considered text‑sparse. This catches pages containing only headers, footers, or minimal incidental text that likely represents noise rather than substantive content.

Text‑to‑Page Area Coverage

The second heuristic measures the ratio of the summed bounding‑box area of native text items to the total page area. If the coverage is less than 0.15 (15%), the page is marked for OCR. This threshold effectively identifies pages where text exists but represents a trivial portion of the layout, such as mostly image‑based slides or scanned diagrams with captions.

Embedded Images

Finally, LiteParse checks for the presence of any raster images on the page. If any embedded image is detected, the page automatically qualifies for OCR review. This ensures that image‑based text—such as screenshots, scanned signatures, or infographic text—is captured even when the other heuristics might not trigger.

Rendering and OCR Execution Pipeline

Once a page is identified as text‑sparse, the render_pages_for_ocr function in ocr_merge.rs (lines 49‑66) renders the page to an RGB bitmap. This bitmap is then passed to the configured OCR backend.

The core decision logic resides at lines 28‑44 of ocr_merge.rs, where the heuristics are calculated and evaluated:

// Simplified logic from ocr_merge.rs#L28-L44
let text_len = page.text_items.iter().map(|t| t.text.len()).sum();
let coverage = calculate_text_coverage(&page.text_items, page.width, page.height);
let has_images = !page.images.is_empty();

if text_len < 20 || coverage < 0.15 || has_images {
    pages_needing_ocr.push(page);
}

The ocr_and_merge_rendered function (lines 71‑80) executes OCR concurrently using the selected engine—Tesseract (if compiled with the tesseract feature), an HTTP OCR server (if ocr_server_url is configured), or a custom implementation of the OcrEngine trait defined in crates/liteparse/src/ocr/mod.rs.

Merging OCR Results Without Duplication

After OCR extracts text from the rendered bitmap, LiteParse merges these new text items back into the page’s existing text_items list. This process includes two critical cleanup steps to ensure data integrity:

Overlap Detection: The overlaps_existing_text function checks whether newly recognized text geometrically overlaps with existing native text elements. If overlap is detected, the duplicate OCR result is discarded to prevent double‑counting text that was already present in the PDF’s text layer.

Artifact Cleaning: The clean_ocr_table_artifacts function removes common OCR noise such as table borders and grid lines that Tesseract often interprets as text characters (like pipes | or dashes -). This ensures that structured data extraction downstream receives clean, accurate text.

Configuration and Usage Examples

Selective OCR is enabled by default in LiteParseConfig (defined in crates/liteparse/src/config.rs). The following examples demonstrate how to parse documents with the default selective OCR behavior across different language bindings.

Rust

use liteparse::LiteParse;
use liteparse::config::LiteParseConfig;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // OCR enabled by default—only text‑sparse pages trigger OCR
    let config = LiteParseConfig::default();
    let parser = LiteParse::new(config);

    // Parse PDF; OCR runs automatically on pages meeting heuristics
    let result = parser.parse("sample.pdf").await?;
    println!("Processed {} pages", result.pages.len());
    Ok(())
}

Node.js / TypeScript

import { LiteParse, LiteParseConfig } from "liteparse";

async function extractDocument() {
  const config = new LiteParseConfig(); // OCR enabled by default
  const parser = new LiteParse(config);
  
  const result = await parser.parse("sample.pdf");
  console.log(`Extracted text from ${result.pages.length} pages`);
  console.log(result.text);
}

extractDocument();

Python

from liteparse import LiteParse, LiteParseConfig

# Initialize with default config—selective OCR active

config = LiteParseConfig()
parser = LiteParse(config)

# Parse with automatic sparse page detection

result = parser.parse("sample.pdf")
print(f"Pages processed: {len(result.pages)}")
print(result.text)

The parser.rs file orchestrates the extraction flow, while projection.rs handles the final text layout after OCR results are integrated, ensuring that text coordinates align correctly with the original document structure.

Summary

  • LiteParse uses three heuristics to detect text‑sparse pages: character count (< 20), text coverage (< 15%), and presence of embedded images.
  • The decision logic lives in ocr_merge.rs (lines 28‑44), with rendering handled at lines 49‑66.
  • Only flagged pages are rendered to bitmap and processed by the OCR engine (Tesseract, HTTP server, or custom).
  • Post‑OCR merging uses overlaps_existing_text to prevent duplication and clean_ocr_table_artifacts to remove noise.
  • Selective OCR is enabled by default via LiteParseConfig in config.rs.

Frequently Asked Questions

What determines if a page is considered text‑sparse in LiteParse?

A page is flagged as text‑sparse if it meets any of three criteria: fewer than 20 total characters, less than 15% of the page area covered by text bounding boxes, or the presence of any embedded raster images. These thresholds are evaluated in ocr_merge.rs at line 44.

How does LiteParse prevent duplicate text when merging OCR results?

The overlaps_existing_text function in ocr_merge.rs performs geometric overlap detection between newly OCR’d text items and existing native text. If an OCR result occupies the same bounding box area as existing text, it is discarded to avoid duplication.

Can I force OCR on every page instead of using selective OCR?

Yes. While selective OCR is enabled by default, you can configure the OCR behavior through LiteParseConfig in config.rs. Disabling the selective heuristics or adjusting the thresholds would require modifying the configuration or using custom preprocessing, though the default behavior targets only pages that genuinely need OCR.

Which OCR engines are supported by LiteParse?

LiteParse supports Tesseract (when compiled with the appropriate feature flag), HTTP OCR servers (configured via ocr_server_url), and custom engines implementing the OcrEngine trait defined in crates/liteparse/src/ocr/mod.rs. The engine is selected automatically based on availability and configuration.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →