# How LiteParse Handles Embedded Images Within PDFs During OCR

> Discover how LiteParse optimizes PDF OCR by intelligently handling embedded images, activating OCR only when necessary for superior text extraction.

- Repository: [LlamaIndex/liteparse](https://github.com/run-llama/liteparse)
- Tags: how-to-guide
- Published: 2026-05-30

---

**LiteParse treats embedded images as a conditional OCR workload that only activates when native PDF text extraction is insufficient, rendering flagged pages to PNG and merging OCR output with existing text while removing overlaps and table artifacts.**

LiteParse, the open-source PDF parsing library from run-llama, implements a selective approach for handling **embedded images within PDFs during OCR**. Instead of processing every page through optical character recognition, the engine evaluates each page's native text content and image density to determine whether OCR is necessary, ensuring efficient processing of mixed-format documents.

## Detecting Image-Rich Pages and OCR Necessity

LiteParse evaluates every page through a two-phase detection system before invoking resource-intensive OCR operations.

### Image Boundary Detection

For each rendered page, LiteParse queries the PDFium wrapper for image bounding boxes via `page.image_bounds(…)`. If the function returns any bounding boxes, the page is flagged as containing embedded images. This check occurs in [`ocr_merge.rs`](https://github.com/run-llama/liteparse/blob/main/ocr_merge.rs) at line 44, where the boolean `has_images` captures whether visual content requires textual extraction.

### The Decision Logic for OCR Activation

OCR activation follows a threshold-based heuristic defined in [`ocr_merge.rs`](https://github.com/run-llama/liteparse/blob/main/ocr_merge.rs) lines 44-45. The engine sets `needs_ocr` to true when either of two conditions is met:

- The page contains minimal native text (`text_length < 20` or `text_coverage < 0.15`)
- The page contains embedded images detected in the previous step

This logic ensures that purely scanned documents and image-heavy pages receive OCR treatment, while text-native PDFs bypass the rasterization stage entirely.

## Rendering and OCR Processing Pipeline

Once a page is flagged for OCR, LiteParse converts the PDF content into a rasterized format suitable for text recognition engines.

### Rasterizing Pages to PNG

Flagged pages are rendered to PNG byte buffers through the [`render.rs`](https://github.com/run-llama/liteparse/blob/main/render.rs) module. The system creates a `RenderedPage` struct containing `png_bytes`, which is generated by the `encode_png` function. This rasterized representation preserves the visual layout of embedded images while creating a format that OCR engines can process. The PNG buffer acts as the bridge between the PDF's vector content and the pixel-based OCR analysis.

### Engine Execution and Result Merging

The PNG bytes are passed to the configured `OcrEngine` implementation. LiteParse supports two primary engines:

- **Built-in Tesseract**: Located in [`ocr/tesseract.rs`](https://github.com/run-llama/liteparse/blob/main/ocr/tesseract.rs), this implementation requires the Tesseract binary installed on the system
- **HTTP Client**: Found in [`ocr/http_simple.rs`](https://github.com/run-llama/liteparse/blob/main/ocr/http_simple.rs), this client enables integration with remote OCR services

After the engine returns OCR results, the `ocr_and_merge_rendered` function (lines 72-80 in [`ocr_merge.rs`](https://github.com/run-llama/liteparse/blob/main/ocr_merge.rs)) coordinates text integration. This process includes:

1. **Overlap detection**: The `overlaps_existing_text` function (lines 167-170) identifies and removes OCR text that duplicates existing PDF text layers
2. **Table artifact cleaning**: The `clean_ocr_table_artifacts` function (lines 191-196) sanitizes formatting characters that commonly appear in OCR output from tables

## OCR Engine Architecture and Configuration

LiteParse abstracts OCR functionality behind the `OcrEngine` trait defined in [`ocr/mod.rs`](https://github.com/run-llama/liteparse/blob/main/ocr/mod.rs), allowing seamless switching between local and remote processing capabilities.

### The OcrEngine Trait

The trait defines a standard interface where implementations receive a PNG byte slice (`&[u8]`) and return a vector of `OcrItem` structs. This abstraction enables the parser core to remain engine-agnostic while supporting diverse deployment scenarios from edge devices to cloud-based services.

### Configuration Activation

OCR behavior is controlled through the `ocr.enabled` flag in [`config.rs`](https://github.com/run-llama/liteparse/blob/main/config.rs). When enabled, the [`parser.rs`](https://github.com/run-llama/liteparse/blob/main/parser.rs) module instantiates the appropriate engine during initialization and routes image-heavy pages through the complete pipeline. The configuration system allows developers to toggle OCR globally while maintaining the selective per-page logic that preserves performance.

## Implementation Examples

The following examples demonstrate enabling OCR for documents containing embedded images across different language bindings.

### Node.js

```typescript
import { LiteParse } from "liteparse";

(async () => {
  const parser = new LiteParse({
    ocr: { enabled: true },
    ocrEngine: "tesseract",
  });

  const result = await parser.parse("invoice-with-scanned-pages.pdf");
  console.log(result.text);
})();

```

### Python

```python
from liteparse import LiteParse

parser = LiteParse(
    ocr={"enabled": True},
    ocr_engine="tesseract",
)

result = parser.parse("contract_scanned.pdf")
print(result.text)

```

### Rust

```rust
use liteparse::{LiteParse, LiteParseConfig};

#[tokio::main]
async fn main() -> Result<(), anyhow::Error> {
    let cfg = LiteParseConfig {
        ocr: liteparse::config::OcrConfig { enabled: true, ..Default::default() },
        ..Default::default()
    };
    let parser = LiteParse::new(cfg);
    let parse_result = parser.parse_path("handwritten_form.pdf").await?;
    println!("{}", parse_result.text);
    Ok(())
}

```

## Summary

- LiteParse detects embedded images using `page.image_bounds(…)` in [`ocr_merge.rs`](https://github.com/run-llama/liteparse/blob/main/ocr_merge.rs) before deciding whether to invoke OCR
- OCR triggers only when native text is insufficient (`text_length < 20` or `text_coverage < 0.15`) or images are present
- Flagged pages rasterize to PNG via `RenderedPage::png_bytes` in [`render.rs`](https://github.com/run-llama/liteparse/blob/main/render.rs) for processing by the selected engine
- The `OcrEngine` trait abstracts both local Tesseract ([`ocr/tesseract.rs`](https://github.com/run-llama/liteparse/blob/main/ocr/tesseract.rs)) and remote HTTP ([`ocr/http_simple.rs`](https://github.com/run-llama/liteparse/blob/main/ocr/http_simple.rs)) implementations
- Post-OCR merging removes duplicate text (`overlaps_existing_text`) and cleans table artifacts (`clean_ocr_table_artifacts`)

## Frequently Asked Questions

### When does LiteParse decide to run OCR on a PDF page?

LiteParse runs OCR when a page contains embedded images detected via `page.image_bounds(…)` or when native text extraction yields fewer than 20 characters or covers less than 15% of the page area. This selective approach ensures OCR only processes pages where visual text extraction is necessary.

### What image format does LiteParse use for OCR processing?

LiteParse renders flagged PDF pages to PNG byte buffers using the `encode_png` function in [`render.rs`](https://github.com/run-llama/liteparse/blob/main/render.rs). The `RenderedPage` struct stores these bytes in `png_bytes`, which the `OcrEngine` implementation receives as a `&[u8]` slice for text recognition.

### How does LiteParse prevent duplicate text from OCR and native PDF layers?

The `ocr_and_merge_rendered` function in [`ocr_merge.rs`](https://github.com/run-llama/liteparse/blob/main/ocr_merge.rs) calls `overlaps_existing_text` (lines 167-170) to detect and filter out OCR results that duplicate existing PDF text. Additionally, `clean_ocr_table_artifacts` (lines 191-196) removes formatting noise common in scanned tables.

### Can I use a custom OCR service instead of Tesseract?

Yes. LiteParse supports custom OCR services through the HTTP client implementation in [`ocr/http_simple.rs`](https://github.com/run-llama/liteparse/blob/main/ocr/http_simple.rs). The `OcrEngine` trait in [`ocr/mod.rs`](https://github.com/run-llama/liteparse/blob/main/ocr/mod.rs) abstracts the interface, allowing you to implement custom clients that conform to the `recognize(&[u8]) -> Vec<OcrItem>` signature while the core logic in [`parser.rs`](https://github.com/run-llama/liteparse/blob/main/parser.rs) handles the orchestration.