How LiteParse Handles Embedded Images Within PDFs During OCR
LiteParse treats embedded images as a conditional OCR workload that only activates when native PDF text extraction is insufficient, rendering flagged pages to PNG and merging OCR output with existing text while removing overlaps and table artifacts.
LiteParse, the open-source PDF parsing library from run-llama, implements a selective approach for handling embedded images within PDFs during OCR. Instead of processing every page through optical character recognition, the engine evaluates each page's native text content and image density to determine whether OCR is necessary, ensuring efficient processing of mixed-format documents.
Detecting Image-Rich Pages and OCR Necessity
LiteParse evaluates every page through a two-phase detection system before invoking resource-intensive OCR operations.
Image Boundary Detection
For each rendered page, LiteParse queries the PDFium wrapper for image bounding boxes via page.image_bounds(…). If the function returns any bounding boxes, the page is flagged as containing embedded images. This check occurs in ocr_merge.rs at line 44, where the boolean has_images captures whether visual content requires textual extraction.
The Decision Logic for OCR Activation
OCR activation follows a threshold-based heuristic defined in ocr_merge.rs lines 44-45. The engine sets needs_ocr to true when either of two conditions is met:
- The page contains minimal native text (
text_length < 20ortext_coverage < 0.15) - The page contains embedded images detected in the previous step
This logic ensures that purely scanned documents and image-heavy pages receive OCR treatment, while text-native PDFs bypass the rasterization stage entirely.
Rendering and OCR Processing Pipeline
Once a page is flagged for OCR, LiteParse converts the PDF content into a rasterized format suitable for text recognition engines.
Rasterizing Pages to PNG
Flagged pages are rendered to PNG byte buffers through the render.rs module. The system creates a RenderedPage struct containing png_bytes, which is generated by the encode_png function. This rasterized representation preserves the visual layout of embedded images while creating a format that OCR engines can process. The PNG buffer acts as the bridge between the PDF's vector content and the pixel-based OCR analysis.
Engine Execution and Result Merging
The PNG bytes are passed to the configured OcrEngine implementation. LiteParse supports two primary engines:
- Built-in Tesseract: Located in
ocr/tesseract.rs, this implementation requires the Tesseract binary installed on the system - HTTP Client: Found in
ocr/http_simple.rs, this client enables integration with remote OCR services
After the engine returns OCR results, the ocr_and_merge_rendered function (lines 72-80 in ocr_merge.rs) coordinates text integration. This process includes:
- Overlap detection: The
overlaps_existing_textfunction (lines 167-170) identifies and removes OCR text that duplicates existing PDF text layers - Table artifact cleaning: The
clean_ocr_table_artifactsfunction (lines 191-196) sanitizes formatting characters that commonly appear in OCR output from tables
OCR Engine Architecture and Configuration
LiteParse abstracts OCR functionality behind the OcrEngine trait defined in ocr/mod.rs, allowing seamless switching between local and remote processing capabilities.
The OcrEngine Trait
The trait defines a standard interface where implementations receive a PNG byte slice (&[u8]) and return a vector of OcrItem structs. This abstraction enables the parser core to remain engine-agnostic while supporting diverse deployment scenarios from edge devices to cloud-based services.
Configuration Activation
OCR behavior is controlled through the ocr.enabled flag in config.rs. When enabled, the parser.rs module instantiates the appropriate engine during initialization and routes image-heavy pages through the complete pipeline. The configuration system allows developers to toggle OCR globally while maintaining the selective per-page logic that preserves performance.
Implementation Examples
The following examples demonstrate enabling OCR for documents containing embedded images across different language bindings.
Node.js
import { LiteParse } from "liteparse";
(async () => {
const parser = new LiteParse({
ocr: { enabled: true },
ocrEngine: "tesseract",
});
const result = await parser.parse("invoice-with-scanned-pages.pdf");
console.log(result.text);
})();
Python
from liteparse import LiteParse
parser = LiteParse(
ocr={"enabled": True},
ocr_engine="tesseract",
)
result = parser.parse("contract_scanned.pdf")
print(result.text)
Rust
use liteparse::{LiteParse, LiteParseConfig};
#[tokio::main]
async fn main() -> Result<(), anyhow::Error> {
let cfg = LiteParseConfig {
ocr: liteparse::config::OcrConfig { enabled: true, ..Default::default() },
..Default::default()
};
let parser = LiteParse::new(cfg);
let parse_result = parser.parse_path("handwritten_form.pdf").await?;
println!("{}", parse_result.text);
Ok(())
}
Summary
- LiteParse detects embedded images using
page.image_bounds(…)inocr_merge.rsbefore deciding whether to invoke OCR - OCR triggers only when native text is insufficient (
text_length < 20ortext_coverage < 0.15) or images are present - Flagged pages rasterize to PNG via
RenderedPage::png_bytesinrender.rsfor processing by the selected engine - The
OcrEnginetrait abstracts both local Tesseract (ocr/tesseract.rs) and remote HTTP (ocr/http_simple.rs) implementations - Post-OCR merging removes duplicate text (
overlaps_existing_text) and cleans table artifacts (clean_ocr_table_artifacts)
Frequently Asked Questions
When does LiteParse decide to run OCR on a PDF page?
LiteParse runs OCR when a page contains embedded images detected via page.image_bounds(…) or when native text extraction yields fewer than 20 characters or covers less than 15% of the page area. This selective approach ensures OCR only processes pages where visual text extraction is necessary.
What image format does LiteParse use for OCR processing?
LiteParse renders flagged PDF pages to PNG byte buffers using the encode_png function in render.rs. The RenderedPage struct stores these bytes in png_bytes, which the OcrEngine implementation receives as a &[u8] slice for text recognition.
How does LiteParse prevent duplicate text from OCR and native PDF layers?
The ocr_and_merge_rendered function in ocr_merge.rs calls overlaps_existing_text (lines 167-170) to detect and filter out OCR results that duplicate existing PDF text. Additionally, clean_ocr_table_artifacts (lines 191-196) removes formatting noise common in scanned tables.
Can I use a custom OCR service instead of Tesseract?
Yes. LiteParse supports custom OCR services through the HTTP client implementation in ocr/http_simple.rs. The OcrEngine trait in ocr/mod.rs abstracts the interface, allowing you to implement custom clients that conform to the recognize(&[u8]) -> Vec<OcrItem> signature while the core logic in parser.rs handles the orchestration.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →