# LiteParse OCR Performance Impact: Enabling vs Disabling Text Recognition

> Discover the performance impact of enabling or disabling OCR in LiteParse. See how OCR affects parsing speed and text extraction for scanned documents and native PDFs.

- Repository: [LlamaIndex/liteparse](https://github.com/run-llama/liteparse)
- Tags: performance
- Published: 2026-05-30

---

**Enabling OCR in LiteParse makes parsing 2–5× slower on scanned documents by adding rasterization, inference, and merge steps, while disabling it skips these entirely for native-PDF text extraction only.**

LiteParse is an open-source PDF parsing library (run-llama/liteparse) that extracts structured text from documents with optional optical character recognition. The **OCR pipeline**, controlled by the `ocr_enabled` flag in `LiteParseConfig`, introduces significant computational overhead through image rasterization and model inference, whereas disabling it processes only native PDF text for maximum speed.

## The Three-Stage OCR Pipeline

When `ocr_enabled` is set to `true` (the default in [`config.rs`](https://github.com/run-llama/liteparse/blob/main/config.rs) at lines 44–45), the parser executes three additional, CPU-intensive stages for each page requiring OCR.

### Page Rendering

First, LiteParse renders the PDF page to a bitmap image at the configured DPI (default 150). In [`parser.rs`](https://github.com/run-llama/liteparse/blob/main/parser.rs) (lines 102–115), the code checks `self.config.ocr_enabled` and calls `ocr_merge::render_pages_for_ocr` to generate the rasterized image. Higher DPI settings produce larger bitmaps, directly increasing both memory pressure and processing time during this rasterization phase.

### Engine Execution

The rendered images are passed to `ocr_merge::ocr_and_merge_rendered` in [`ocr_merge.rs`](https://github.com/run-llama/liteparse/blob/main/ocr_merge.rs) (lines 72–89), which spawns asynchronous workers (defaulting to CPU cores minus one) to execute the OCR engine. The engine can be the built-in **Tesseract** (CPU-bound local inference) or an **HTTP OCR server** (potentially GPU-accelerated but network-dependent). Each page requires complete model inference or a network round-trip, making this the most expensive stage.

### Result Merging

Finally, OCR text blocks must be merged back into the document structure. The implementation in [`ocr_merge.rs`](https://github.com/run-llama/liteparse/blob/main/ocr_merge.rs) (lines 130–156) filters out overlapping regions between native PDF text and OCR results, discarding artifacts that exceed a small tolerance threshold. This prevents duplicate text entries but adds processing time proportional to page complexity.

## Selective OCR and Performance Thresholds

LiteParse does not indiscriminately OCR every page. According to the logic in [`ocr_merge.rs`](https://github.com/run-llama/liteparse/blob/main/ocr_merge.rs) (lines 44–45), a page is flagged for OCR only if it meets specific criteria: native text extraction yields fewer than 20 characters, coverage is below 15%, or the page contains embedded images. This selectivity means digitally-born PDFs with full native text experience minimal overhead (rendering only a few pages) compared to scanned documents where every page triggers the full pipeline.

## Performance Factors and Bottlenecks

Several configuration variables determine the magnitude of the performance penalty when OCR is active:

- **DPI Setting** (`config.dpi`): Higher resolution produces larger images, linearly increasing both rendering time and OCR inference cost.
- **Page Complexity**: PDFs containing mostly scanned images trigger OCR on every page, incurring the full 2–5× slowdown, while text-based documents see only marginal delays.
- **OCR Engine Choice**: Local Tesseract execution is CPU-bound and blocks during inference, whereas HTTP servers add network latency and bandwidth constraints.
- **Parallel Workers** (`--num-workers`): Increasing the worker count (default: CPU cores minus one) improves throughput for large documents but saturates CPU resources.
- **Network Conditions**: Remote OCR endpoints introduce variable latency that can dominate total processing time for small documents.

## Disabling OCR for Maximum Speed

When `ocr_enabled` is `false`, the parser skips the entire rendering-and-OCR block in [`parser.rs`](https://github.com/run-llama/liteparse/blob/main/parser.rs). Execution jumps directly from native PDF text extraction to spatial projection, eliminating rasterization, inference, and merging overhead entirely. The CLI provides the `--no-ocr` flag (defined in [`main.rs`](https://github.com/run-llama/liteparse/blob/main/main.rs) at lines 51–55) to toggle this quickly.

## Configuration Examples

### Command Line Interface

```bash

# OCR enabled (default) - slower for scanned documents

liteparse parse document.pdf --output json

# Disable OCR for maximum speed on text-based PDFs

liteparse parse document.pdf --no-ocr --output json

```

*The `--no-ocr` flag sets `ocr_enabled: false` in the configuration struct.*

### Node.js SDK

```typescript
import { LiteParse } from "liteparse";

// OCR enabled (default)
const parser = new LiteParse({
  ocrEnabled: true,
  dpi: 150,
  numWorkers: 4,
});
await parser.parse("document.pdf");

// Disable OCR for faster parsing
const fastParser = new LiteParse({ ocrEnabled: false });
await fastParser.parse("document.pdf");

```

*The `ocrEnabled` option maps to `config.ocr_enabled` in the Rust core.*

### Python Wrapper

```python
from liteparse import LiteParse

# OCR enabled (default)

lp = LiteParse(ocr_enabled=True)
result = lp.parse("document.pdf")

# Disable OCR for native-text-only extraction

lp_fast = LiteParse(ocr_enabled=False)
result_fast = lp_fast.parse("document.pdf")

```

*The Python wrapper forwards the flag to the underlying Rust config as implemented in the Python bindings.*

## Summary

- **Enabling OCR** triggers three expensive stages (rendering, engine execution, merging) that make parsing **2–5× slower** for scanned PDFs.
- **Disabling OCR** removes all rasterization and inference overhead, processing only native PDF text at maximum speed.
- LiteParse uses **selective OCR** (pages with low text coverage or images only) to minimize unnecessary processing.
- Configuration occurs via the `ocr_enabled` flag (default `true`) in `LiteParseConfig`, controllable through CLI (`--no-ocr`), Node.js (`ocrEnabled`), or Python (`ocr_enabled`).

## Frequently Asked Questions

### How do I know if my PDF needs OCR enabled?

Check if the document contains scanned images or lacks selectable text. Digitally-born PDFs with native text extract correctly with `ocr_enabled: false`, while scanned documents require OCR to recover text. LiteParse automatically detects this in [`ocr_merge.rs`](https://github.com/run-llama/liteparse/blob/main/ocr_merge.rs) by checking for fewer than 20 characters or 15% text coverage per page.

### Does disabling OCR affect output accuracy for scanned documents?

Yes. When `ocr_enabled` is `false`, LiteParse only extracts native PDF text streams. Scanned pages without embedded text produce empty output, as the parser skips the rasterization and OCR stages entirely.

### Can I adjust the DPI to balance speed and accuracy?

Yes. Lowering `config.dpi` from the default 150 to 100 or 72 reduces bitmap size and rendering time, though it may decrease OCR accuracy for small fonts. This trade-off is managed through the `dpi` parameter in your SDK or CLI configuration.

### Why does LiteParse use CPU cores minus one for OCR workers by default?

The default worker count (`num_workers: CPU cores - 1`) prevents total CPU saturation, leaving one core available for system processes and the main event loop. Increasing this value accelerates OCR throughput for batches but may cause contention on resource-constrained systems.