LiteParse OCR Performance Impact: Enabling vs Disabling Text Recognition

Enabling OCR in LiteParse makes parsing 2–5× slower on scanned documents by adding rasterization, inference, and merge steps, while disabling it skips these entirely for native-PDF text extraction only.

LiteParse is an open-source PDF parsing library (run-llama/liteparse) that extracts structured text from documents with optional optical character recognition. The OCR pipeline, controlled by the ocr_enabled flag in LiteParseConfig, introduces significant computational overhead through image rasterization and model inference, whereas disabling it processes only native PDF text for maximum speed.

The Three-Stage OCR Pipeline

When ocr_enabled is set to true (the default in config.rs at lines 44–45), the parser executes three additional, CPU-intensive stages for each page requiring OCR.

Page Rendering

First, LiteParse renders the PDF page to a bitmap image at the configured DPI (default 150). In parser.rs (lines 102–115), the code checks self.config.ocr_enabled and calls ocr_merge::render_pages_for_ocr to generate the rasterized image. Higher DPI settings produce larger bitmaps, directly increasing both memory pressure and processing time during this rasterization phase.

Engine Execution

The rendered images are passed to ocr_merge::ocr_and_merge_rendered in ocr_merge.rs (lines 72–89), which spawns asynchronous workers (defaulting to CPU cores minus one) to execute the OCR engine. The engine can be the built-in Tesseract (CPU-bound local inference) or an HTTP OCR server (potentially GPU-accelerated but network-dependent). Each page requires complete model inference or a network round-trip, making this the most expensive stage.

Result Merging

Finally, OCR text blocks must be merged back into the document structure. The implementation in ocr_merge.rs (lines 130–156) filters out overlapping regions between native PDF text and OCR results, discarding artifacts that exceed a small tolerance threshold. This prevents duplicate text entries but adds processing time proportional to page complexity.

Selective OCR and Performance Thresholds

LiteParse does not indiscriminately OCR every page. According to the logic in ocr_merge.rs (lines 44–45), a page is flagged for OCR only if it meets specific criteria: native text extraction yields fewer than 20 characters, coverage is below 15%, or the page contains embedded images. This selectivity means digitally-born PDFs with full native text experience minimal overhead (rendering only a few pages) compared to scanned documents where every page triggers the full pipeline.

Performance Factors and Bottlenecks

Several configuration variables determine the magnitude of the performance penalty when OCR is active:

  • DPI Setting (config.dpi): Higher resolution produces larger images, linearly increasing both rendering time and OCR inference cost.
  • Page Complexity: PDFs containing mostly scanned images trigger OCR on every page, incurring the full 2–5× slowdown, while text-based documents see only marginal delays.
  • OCR Engine Choice: Local Tesseract execution is CPU-bound and blocks during inference, whereas HTTP servers add network latency and bandwidth constraints.
  • Parallel Workers (--num-workers): Increasing the worker count (default: CPU cores minus one) improves throughput for large documents but saturates CPU resources.
  • Network Conditions: Remote OCR endpoints introduce variable latency that can dominate total processing time for small documents.

Disabling OCR for Maximum Speed

When ocr_enabled is false, the parser skips the entire rendering-and-OCR block in parser.rs. Execution jumps directly from native PDF text extraction to spatial projection, eliminating rasterization, inference, and merging overhead entirely. The CLI provides the --no-ocr flag (defined in main.rs at lines 51–55) to toggle this quickly.

Configuration Examples

Command Line Interface


# OCR enabled (default) - slower for scanned documents

liteparse parse document.pdf --output json

# Disable OCR for maximum speed on text-based PDFs

liteparse parse document.pdf --no-ocr --output json

The --no-ocr flag sets ocr_enabled: false in the configuration struct.

Node.js SDK

import { LiteParse } from "liteparse";

// OCR enabled (default)
const parser = new LiteParse({
  ocrEnabled: true,
  dpi: 150,
  numWorkers: 4,
});
await parser.parse("document.pdf");

// Disable OCR for faster parsing
const fastParser = new LiteParse({ ocrEnabled: false });
await fastParser.parse("document.pdf");

The ocrEnabled option maps to config.ocr_enabled in the Rust core.

Python Wrapper

from liteparse import LiteParse

# OCR enabled (default)

lp = LiteParse(ocr_enabled=True)
result = lp.parse("document.pdf")

# Disable OCR for native-text-only extraction

lp_fast = LiteParse(ocr_enabled=False)
result_fast = lp_fast.parse("document.pdf")

The Python wrapper forwards the flag to the underlying Rust config as implemented in the Python bindings.

Summary

  • Enabling OCR triggers three expensive stages (rendering, engine execution, merging) that make parsing 2–5× slower for scanned PDFs.
  • Disabling OCR removes all rasterization and inference overhead, processing only native PDF text at maximum speed.
  • LiteParse uses selective OCR (pages with low text coverage or images only) to minimize unnecessary processing.
  • Configuration occurs via the ocr_enabled flag (default true) in LiteParseConfig, controllable through CLI (--no-ocr), Node.js (ocrEnabled), or Python (ocr_enabled).

Frequently Asked Questions

How do I know if my PDF needs OCR enabled?

Check if the document contains scanned images or lacks selectable text. Digitally-born PDFs with native text extract correctly with ocr_enabled: false, while scanned documents require OCR to recover text. LiteParse automatically detects this in ocr_merge.rs by checking for fewer than 20 characters or 15% text coverage per page.

Does disabling OCR affect output accuracy for scanned documents?

Yes. When ocr_enabled is false, LiteParse only extracts native PDF text streams. Scanned pages without embedded text produce empty output, as the parser skips the rasterization and OCR stages entirely.

Can I adjust the DPI to balance speed and accuracy?

Yes. Lowering config.dpi from the default 150 to 100 or 72 reduces bitmap size and rendering time, though it may decrease OCR accuracy for small fonts. This trade-off is managed through the dpi parameter in your SDK or CLI configuration.

Why does LiteParse use CPU cores minus one for OCR workers by default?

The default worker count (num_workers: CPU cores - 1) prevents total CPU saturation, leaving one core available for system processes and the main event loop. Increasing this value accelerates OCR throughput for batches but may cause contention on resource-constrained systems.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →