how-to-guide

How LiteParse Handles Embedded Images Within PDFs During OCR

May 30, 2026 run-llama/liteparse ↗

LiteParse treats embedded images as a conditional OCR workload that only activates when native PDF text extraction is insufficient, rendering flagged pages to PNG and merging OCR output with existing text while removing overlaps and table artifacts.

LiteParse, the open-source PDF parsing library from run-llama, implements a selective approach for handling embedded images within PDFs during OCR. Instead of processing every page through optical character recognition, the engine evaluates each page's native text content and image density to determine whether OCR is necessary, ensuring efficient processing of mixed-format documents.

Detecting Image-Rich Pages and OCR Necessity

LiteParse evaluates every page through a two-phase detection system before invoking resource-intensive OCR operations.

Image Boundary Detection

For each rendered page, LiteParse queries the PDFium wrapper for image bounding boxes via page.image_bounds(…). If the function returns any bounding boxes, the page is flagged as containing embedded images. This check occurs in ocr_merge.rs at line 44, where the boolean has_images captures whether visual content requires textual extraction.

The Decision Logic for OCR Activation

OCR activation follows a threshold-based heuristic defined in ocr_merge.rs lines 44-45. The engine sets needs_ocr to true when either of two conditions is met:

The page contains minimal native text (text_length < 20 or text_coverage < 0.15)
The page contains embedded images detected in the previous step

This logic ensures that purely scanned documents and image-heavy pages receive OCR treatment, while text-native PDFs bypass the rasterization stage entirely.

Rendering and OCR Processing Pipeline

Once a page is flagged for OCR, LiteParse converts the PDF content into a rasterized format suitable for text recognition engines.

Rasterizing Pages to PNG

Flagged pages are rendered to PNG byte buffers through the render.rs module. The system creates a RenderedPage struct containing png_bytes, which is generated by the encode_png function. This rasterized representation preserves the visual layout of embedded images while creating a format that OCR engines can process. The PNG buffer acts as the bridge between the PDF's vector content and the pixel-based OCR analysis.

Engine Execution and Result Merging

The PNG bytes are passed to the configured OcrEngine implementation. LiteParse supports two primary engines:

Built-in Tesseract: Located in ocr/tesseract.rs, this implementation requires the Tesseract binary installed on the system
HTTP Client: Found in ocr/http_simple.rs, this client enables integration with remote OCR services

After the engine returns OCR results, the ocr_and_merge_rendered function (lines 72-80 in ocr_merge.rs) coordinates text integration. This process includes:

Overlap detection: The overlaps_existing_text function (lines 167-170) identifies and removes OCR text that duplicates existing PDF text layers
Table artifact cleaning: The clean_ocr_table_artifacts function (lines 191-196) sanitizes formatting characters that commonly appear in OCR output from tables

OCR Engine Architecture and Configuration

LiteParse abstracts OCR functionality behind the OcrEngine trait defined in ocr/mod.rs, allowing seamless switching between local and remote processing capabilities.

The OcrEngine Trait

The trait defines a standard interface where implementations receive a PNG byte slice (&[u8]) and return a vector of OcrItem structs. This abstraction enables the parser core to remain engine-agnostic while supporting diverse deployment scenarios from edge devices to cloud-based services.

Configuration Activation

OCR behavior is controlled through the ocr.enabled flag in config.rs. When enabled, the parser.rs module instantiates the appropriate engine during initialization and routes image-heavy pages through the complete pipeline. The configuration system allows developers to toggle OCR globally while maintaining the selective per-page logic that preserves performance.

Implementation Examples

The following examples demonstrate enabling OCR for documents containing embedded images across different language bindings.

Node.js

import { LiteParse } from "liteparse";

(async () => {
  const parser = new LiteParse({
    ocr: { enabled: true },
    ocrEngine: "tesseract",
  });

  const result = await parser.parse("invoice-with-scanned-pages.pdf");
  console.log(result.text);
})();

Python

from liteparse import LiteParse

parser = LiteParse(
    ocr={"enabled": True},
    ocr_engine="tesseract",
)

result = parser.parse("contract_scanned.pdf")
print(result.text)

Rust

use liteparse::{LiteParse, LiteParseConfig};

#[tokio::main]
async fn main() -> Result<(), anyhow::Error> {
    let cfg = LiteParseConfig {
        ocr: liteparse::config::OcrConfig { enabled: true, ..Default::default() },
        ..Default::default()
    };
    let parser = LiteParse::new(cfg);
    let parse_result = parser.parse_path("handwritten_form.pdf").await?;
    println!("{}", parse_result.text);
    Ok(())
}

Summary

LiteParse detects embedded images using page.image_bounds(…) in ocr_merge.rs before deciding whether to invoke OCR
OCR triggers only when native text is insufficient (text_length < 20 or text_coverage < 0.15) or images are present
Flagged pages rasterize to PNG via RenderedPage::png_bytes in render.rs for processing by the selected engine
The OcrEngine trait abstracts both local Tesseract (ocr/tesseract.rs) and remote HTTP (ocr/http_simple.rs) implementations
Post-OCR merging removes duplicate text (overlaps_existing_text) and cleans table artifacts (clean_ocr_table_artifacts)

Frequently Asked Questions

When does LiteParse decide to run OCR on a PDF page?

LiteParse runs OCR when a page contains embedded images detected via page.image_bounds(…) or when native text extraction yields fewer than 20 characters or covers less than 15% of the page area. This selective approach ensures OCR only processes pages where visual text extraction is necessary.

What image format does LiteParse use for OCR processing?

LiteParse renders flagged PDF pages to PNG byte buffers using the encode_png function in render.rs. The RenderedPage struct stores these bytes in png_bytes, which the OcrEngine implementation receives as a &[u8] slice for text recognition.

How does LiteParse prevent duplicate text from OCR and native PDF layers?

The ocr_and_merge_rendered function in ocr_merge.rs calls overlaps_existing_text (lines 167-170) to detect and filter out OCR results that duplicate existing PDF text. Additionally, clean_ocr_table_artifacts (lines 191-196) removes formatting noise common in scanned tables.

Can I use a custom OCR service instead of Tesseract?

Yes. LiteParse supports custom OCR services through the HTTP client implementation in ocr/http_simple.rs. The OcrEngine trait in ocr/mod.rs abstracts the interface, allowing you to implement custom clients that conform to the recognize(&[u8]) -> Vec<OcrItem> signature while the core logic in parser.rs handles the orchestration.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:

curl -s "https://instagit.com/install.md"

Add to your MCP client configuration:

{
  "mcpServers": {
    "instagit": {
      "command": "npx",
      "args": ["-y", "instagit@latest"]
    }
  }
}

Ask your agent:

"Use Instagit MCP to understand how run-llama/liteparse works."

Works with

Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →