How LiteParse's OCR Merging Algorithm Combines Native PDF Text with OCR Results

LiteParse extracts native PDF text first, then selectively runs OCR on pages with insufficient coverage, merging results back into the document structure while filtering duplicates and cleaning artifacts.

The run-llama/liteparse repository implements a deterministic OCR merging algorithm that preserves existing text layout while filling gaps with recognized content. This approach minimizes unnecessary OCR processing and prevents duplicate content in the final output.

Step 1: Determining Which Pages Need OCR

The algorithm begins by analyzing each Page to calculate text coverage. In crates/liteparse/src/ocr_merge.rs, the code sums the native text length and computes the ratio of text bounding-box area to total page area. A page triggers OCR when the text is very short, covers less than 15% of the page, or contains images.

The determination happens in render_pages_for_ocr at line 44:

let needs_ocr = native_text_len < threshold ||
                text_coverage_ratio < 0.15 ||
                page_contains_images;

This selective approach ensures OCR only runs on pages that truly need it, reducing processing time and API costs.

Step 2: Rendering Pages to Bitmaps

For pages requiring OCR, LiteParse uses PDFium to render the page at the configured DPI into an RGB bitmap. The rendering occurs between lines 49-60 in ocr_merge.rs:

let bitmap = page_obj.render(dpi)?;
let width = bitmap.width();
let height = bitmap.height();
let rgb_bytes = bitmap.to_rgb_bytes(); // Consumed by the OCR engine

The resulting byte buffer matches the dimensions and color space expected by the OCR backend, whether using local Tesseract or a remote HTTP OCR server.

Step 3: Concurrent OCR Processing

To maximize throughput, LiteParse processes multiple pages concurrently using a Tokio semaphore to cap the number of parallel workers. The implementation spawns blocking tasks that run the actual OCR recognition, keeping the async runtime responsive while the CPU-intensive work executes on separate threads.

Between lines 88-106 in ocr_merge.rs, the code creates worker tasks:

let permit = semaphore.clone().acquire_owned().await?;
tokio::task::spawn_blocking(move || {
    let text = engine.recognize(&rgb_bytes, width, height)?;
    // Returns OCR results with pixel-space bounding boxes
})

This architecture ensures thread safety by never exposing PDFium's non-Send document pointers to the worker threads.

Step 4: Coordinate Transformation

OCR engines return bounding boxes in pixel space, which must convert to PDF points for proper insertion. The algorithm calculates a scale factor using the formula scale_factor = 72 / dpi, then applies it to each coordinate at lines 135-138:

let scale_factor = 72.0 / dpi;
let ocr_x = result.bbox[0] * scale_factor;
let ocr_y = result.bbox[1] * scale_factor;
// Width and height receive the same scaling

This transformation ensures that OCR-derived text items align precisely with native PDF content in the final layout.

Step 5: Deduplication and Artifact Cleaning

Before inserting any OCR result, the algorithm checks for spatial overlap with existing native TextItem objects. The overlaps_existing_text function (line 140-142) compares bounding boxes with a small tolerance, skipping any OCR items that would duplicate existing content:

if overlaps_existing_text(&page.text_items, &ocr_bbox) {
    continue; // Skip duplicate region
}

Next, the clean_ocr_table_artifacts function removes common OCR noise characters like |, [, and ] that frequently appear when recognizing tabular data. This cleaning happens at lines 144-147, preventing spurious characters from polluting the final text stream.

Step 6: Merging Results into Page Structure

After validation and cleaning, LiteParse constructs a new TextItem containing the cleaned OCR text, transformed coordinates, inferred font size, and confidence score. The item appends directly to the page's text_items vector at lines 149-158:

page.text_items.push(TextItem {
    text: cleaned_text,
    bbox: transformed_bbox,
    font_size: inferred_size,
    confidence: ocr_confidence,
    // ... additional metadata
});

Following insertion, the normal layout projection continues in parser.rs (line 172), treating native and OCR-derived items identically for downstream processing.

Configuration and Usage Examples

Both the Node.js and Python packages expose configuration options that map directly to the Rust core's LiteParseConfig struct.

Node.js

import { LiteParse } from "@run-llama/liteparse";

async function parseWithOcr(path: string) {
  const parser = new LiteParse({
    ocrEnabled: true,
    ocrLanguage: "eng",
    dpi: 300,
    // Optional: ocrServerUrl: "http://localhost:8000/ocr"
  });

  const result = await parser.parse(path);
  console.log(result.text);
}

Python

from liteparse import LiteParse

def parse_with_ocr(file_path: str):
    lp = LiteParse(
        ocr_enabled=True,
        ocr_language="eng",
        dpi=300
    )
    result = lp.parse(file_path)
    return result.to_markdown()

Summary

  • Selective Processing: Only pages with <15% text coverage or detected images undergo OCR.
  • Safe Concurrency: Tokio semaphores manage parallel OCR workers while keeping PDFium pointers thread-local.
  • Coordinate Accuracy: Pixel-space OCR results scale to PDF points using 72/dpi for precise alignment.
  • Duplicate Prevention: Spatial overlap checks with tolerance ensure OCR text never overwrites native content.
  • Noise Reduction: Automatic cleaning removes table-border artifacts like pipes and brackets.

Frequently Asked Questions

How does LiteParse decide when to use OCR versus native text?

The algorithm in crates/liteparse/src/ocr_merge.rs evaluates three conditions: total native text length, the ratio of text bounding-box area to page area (threshold of 15%), and the presence of images. If native text is insufficient or missing, the page enters the OCR pipeline.

What prevents duplicate text when merging OCR results?

Before inserting any OCR item, the code calls overlaps_existing_text to check if the OCR bounding box intersects with existing native text items. Overlapping regions are skipped entirely, ensuring the final output contains only unique content from the most reliable source.

How does the algorithm handle table borders and OCR noise?

The clean_ocr_table_artifacts function in ocr_merge.rs strips common OCR misreadings like vertical bars (|), square brackets ([]), and other punctuation that frequently appear when recognizing tabular data. This prevents structural characters from appearing in the extracted text content.

Can I configure the DPI for OCR rendering?

Yes. Both the Node.js and Python APIs accept a dpi parameter in the configuration options. Higher values improve OCR accuracy for small text but increase processing time and memory usage. The default configuration provides a balance suitable for most documents.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →