# How LiteParse's OCR Merging Algorithm Combines Native PDF Text with OCR Results

> Discover how LiteParse's OCR merging algorithm combines native PDF text and OCR results efficiently. Learn about selective OCR, duplicate filtering, and artifact cleaning for superior document processing.

- Repository: [LlamaIndex/liteparse](https://github.com/run-llama/liteparse)
- Tags: internals
- Published: 2026-05-31

---

**LiteParse extracts native PDF text first, then selectively runs OCR on pages with insufficient coverage, merging results back into the document structure while filtering duplicates and cleaning artifacts.**

The `run-llama/liteparse` repository implements a deterministic **OCR merging algorithm** that preserves existing text layout while filling gaps with recognized content. This approach minimizes unnecessary OCR processing and prevents duplicate content in the final output.

## Step 1: Determining Which Pages Need OCR

The algorithm begins by analyzing each `Page` to calculate text coverage. In [`crates/liteparse/src/ocr_merge.rs`](https://github.com/run-llama/liteparse/blob/main/crates/liteparse/src/ocr_merge.rs), the code sums the native text length and computes the ratio of text bounding-box area to total page area. A page triggers OCR when the text is very short, covers less than **15%** of the page, or contains images.

The determination happens in `render_pages_for_ocr` at line 44:

```rust
let needs_ocr = native_text_len < threshold ||
                text_coverage_ratio < 0.15 ||
                page_contains_images;

```

This selective approach ensures OCR only runs on pages that truly need it, reducing processing time and API costs.

## Step 2: Rendering Pages to Bitmaps

For pages requiring OCR, LiteParse uses PDFium to render the page at the configured DPI into an RGB bitmap. The rendering occurs between lines 49-60 in [`ocr_merge.rs`](https://github.com/run-llama/liteparse/blob/main/ocr_merge.rs):

```rust
let bitmap = page_obj.render(dpi)?;
let width = bitmap.width();
let height = bitmap.height();
let rgb_bytes = bitmap.to_rgb_bytes(); // Consumed by the OCR engine

```

The resulting byte buffer matches the dimensions and color space expected by the OCR backend, whether using local Tesseract or a remote HTTP OCR server.

## Step 3: Concurrent OCR Processing

To maximize throughput, LiteParse processes multiple pages concurrently using a Tokio semaphore to cap the number of parallel workers. The implementation spawns blocking tasks that run the actual OCR recognition, keeping the async runtime responsive while the CPU-intensive work executes on separate threads.

Between lines 88-106 in [`ocr_merge.rs`](https://github.com/run-llama/liteparse/blob/main/ocr_merge.rs), the code creates worker tasks:

```rust
let permit = semaphore.clone().acquire_owned().await?;
tokio::task::spawn_blocking(move || {
    let text = engine.recognize(&rgb_bytes, width, height)?;
    // Returns OCR results with pixel-space bounding boxes
})

```

This architecture ensures thread safety by never exposing PDFium's non-`Send` document pointers to the worker threads.

## Step 4: Coordinate Transformation

OCR engines return bounding boxes in pixel space, which must convert to PDF points for proper insertion. The algorithm calculates a scale factor using the formula `scale_factor = 72 / dpi`, then applies it to each coordinate at lines 135-138:

```rust
let scale_factor = 72.0 / dpi;
let ocr_x = result.bbox[0] * scale_factor;
let ocr_y = result.bbox[1] * scale_factor;
// Width and height receive the same scaling

```

This transformation ensures that OCR-derived text items align precisely with native PDF content in the final layout.

## Step 5: Deduplication and Artifact Cleaning

Before inserting any OCR result, the algorithm checks for spatial overlap with existing native `TextItem` objects. The `overlaps_existing_text` function (line 140-142) compares bounding boxes with a small tolerance, skipping any OCR items that would duplicate existing content:

```rust
if overlaps_existing_text(&page.text_items, &ocr_bbox) {
    continue; // Skip duplicate region
}

```

Next, the `clean_ocr_table_artifacts` function removes common OCR noise characters like `|`, `[`, and `]` that frequently appear when recognizing tabular data. This cleaning happens at lines 144-147, preventing spurious characters from polluting the final text stream.

## Step 6: Merging Results into Page Structure

After validation and cleaning, LiteParse constructs a new `TextItem` containing the cleaned OCR text, transformed coordinates, inferred font size, and confidence score. The item appends directly to the page's `text_items` vector at lines 149-158:

```rust
page.text_items.push(TextItem {
    text: cleaned_text,
    bbox: transformed_bbox,
    font_size: inferred_size,
    confidence: ocr_confidence,
    // ... additional metadata
});

```

Following insertion, the normal layout projection continues in [`parser.rs`](https://github.com/run-llama/liteparse/blob/main/parser.rs) (line 172), treating native and OCR-derived items identically for downstream processing.

## Configuration and Usage Examples

Both the Node.js and Python packages expose configuration options that map directly to the Rust core's `LiteParseConfig` struct.

### Node.js

```typescript
import { LiteParse } from "@run-llama/liteparse";

async function parseWithOcr(path: string) {
  const parser = new LiteParse({
    ocrEnabled: true,
    ocrLanguage: "eng",
    dpi: 300,
    // Optional: ocrServerUrl: "http://localhost:8000/ocr"
  });

  const result = await parser.parse(path);
  console.log(result.text);
}

```

### Python

```python
from liteparse import LiteParse

def parse_with_ocr(file_path: str):
    lp = LiteParse(
        ocr_enabled=True,
        ocr_language="eng",
        dpi=300
    )
    result = lp.parse(file_path)
    return result.to_markdown()

```

## Summary

- **Selective Processing**: Only pages with <15% text coverage or detected images undergo OCR.
- **Safe Concurrency**: Tokio semaphores manage parallel OCR workers while keeping PDFium pointers thread-local.
- **Coordinate Accuracy**: Pixel-space OCR results scale to PDF points using `72/dpi` for precise alignment.
- **Duplicate Prevention**: Spatial overlap checks with tolerance ensure OCR text never overwrites native content.
- **Noise Reduction**: Automatic cleaning removes table-border artifacts like pipes and brackets.

## Frequently Asked Questions

### How does LiteParse decide when to use OCR versus native text?

The algorithm in [`crates/liteparse/src/ocr_merge.rs`](https://github.com/run-llama/liteparse/blob/main/crates/liteparse/src/ocr_merge.rs) evaluates three conditions: total native text length, the ratio of text bounding-box area to page area (threshold of 15%), and the presence of images. If native text is insufficient or missing, the page enters the OCR pipeline.

### What prevents duplicate text when merging OCR results?

Before inserting any OCR item, the code calls `overlaps_existing_text` to check if the OCR bounding box intersects with existing native text items. Overlapping regions are skipped entirely, ensuring the final output contains only unique content from the most reliable source.

### How does the algorithm handle table borders and OCR noise?

The `clean_ocr_table_artifacts` function in [`ocr_merge.rs`](https://github.com/run-llama/liteparse/blob/main/ocr_merge.rs) strips common OCR misreadings like vertical bars (`|`), square brackets (`[]`), and other punctuation that frequently appear when recognizing tabular data. This prevents structural characters from appearing in the extracted text content.

### Can I configure the DPI for OCR rendering?

Yes. Both the Node.js and Python APIs accept a `dpi` parameter in the configuration options. Higher values improve OCR accuracy for small text but increase processing time and memory usage. The default configuration provides a balance suitable for most documents.