How to Configure LiteParse for Batch Parsing with Concurrent Workers

Set the num_workers parameter in LiteParseConfig to limit concurrent OCR tasks via an internal semaphore, defaulting to your CPU core count minus one.

When processing large document collections with the run-llama/liteparse library, configuring concurrent workers is essential for optimizing throughput. LiteParse processes each document through three distinct stages—PDF rendering, optical character recognition (OCR), and spatial projection—but only the OCR phase benefits from parallelization. By adjusting the num_workers configuration, you control how many pages undergo simultaneous OCR processing, preventing resource exhaustion while maximizing batch parsing performance.

Understanding the Worker Pool Architecture

LiteParse operates sequentially for rendering and spatial projection, but uses an asynchronous worker pool specifically for OCR operations. According to the source code in ocr_merge.rs, the library initializes a tokio::sync::Semaphore where the number of permits equals your configured num_workers value.

When OCR is enabled, the parsing flow follows this pattern:

  1. Render: Each page converts to a raster image sequentially
  2. OCR: Each rendered image acquires a semaphore permit before sending the request to the OCR engine
  3. Merge: Results return and the permit releases for the next queued page

This architecture ensures that CPU-intensive rendering remains sequential while I/O-bound OCR requests execute concurrently up to your specified limit.

Configuration Methods

You can configure num_workers through four interfaces depending on your implementation language.

Command Line Interface (CLI)

The Rust binary accepts a --num-workers flag that propagates directly to the internal configuration. In main.rs, the CLI argument parses into cmd.num_workers and maps to config.num_workers before initialization.


# Process a directory with 8 concurrent OCR workers

liteparse batch ./documents --num-workers 8 --output-format json

Python SDK

The Python wrapper accepts num_workers as a constructor argument in parser.py, forwarding the value to the underlying Rust configuration via kwargs.

from liteparse import LiteParse

parser = LiteParse(
    ocr_enabled=True,
    num_workers=4,  # Limit to 4 concurrent OCR operations

    output_format="json"
)

results = parser.parse_folder("./invoices")
print(results)

Node.js SDK

The Node.js wrapper exposes the same parameter through the LiteParse class options. The TypeScript definitions in cli.ts show the flag forwarding to the native binary.

import { LiteParse } from "liteparse";

const parser = new LiteParse({
  ocrEnabled: true,
  numWorkers: 6,  // Controls the OCR concurrency limit
  outputFormat: "json",
});

const results = await parser.parseFolder("./documents");
console.log(results);

Rust (Programmatic)

When using LiteParse directly in Rust, set the field on the LiteParseConfig struct before instantiating the parser. The field definition resides in config.rs.

use liteparse::{LiteParse, LiteParseConfig, OutputFormat};

let config = LiteParseConfig {
    ocr_enabled: true,
    num_workers: 5,  // Custom worker count
    output_format: OutputFormat::Json,
    ..Default::default()
};

let parser = LiteParse::new(config);
let document = parser.parse("report.pdf").await?;

Default Behavior and Semaphore Implementation

If you omit the num_workers parameter, LiteParse automatically calculates an optimal default. The function default_num_workers() in config.rs returns the number of logical CPU cores minus one, with a minimum of one worker to prevent deadlock.

The semaphore implementation in ocr_merge.rs creates the concurrency boundary:

  • Each page requiring OCR attempts to acquire a permit from the semaphore
  • If all permits are active, the task yields until one becomes available
  • Upon OCR completion, the permit releases back to the pool

This bounded concurrency prevents memory bloat from buffering too many rendered images while keeping the OCR pipeline saturated.

Performance Considerations

Adjust num_workers based on your OCR provider capacity. While the default leverages available CPU cores, external OCR services (like cloud APIs) may impose rate limits that necessitate lower concurrency values.

Memory scales linearly with worker count. Each concurrent worker holds one rendered page image in memory until OCR completes. Reducing workers lowers peak memory usage during batch processing of high-resolution documents.

Rendering remains single-threaded. Increasing workers beyond CPU count provides diminishing returns since the rendering stage acts as the bottleneck upstream of the OCR pool.

Summary

  • Configure num_workers in LiteParseConfig to control OCR concurrency for batch parsing
  • The default value equals CPU cores minus one, calculated in config.rs
  • Internal implementation uses a tokio::sync::Semaphore in ocr_merge.rs to limit simultaneous OCR requests
  • Set the parameter via CLI (--num-workers), Python (num_workers), Node.js (numWorkers), or Rust struct initialization
  • Balance throughput against memory usage and external API rate limits when tuning worker counts

Frequently Asked Questions

What is the default number of workers if I don't specify one?

LiteParse automatically sets num_workers to the number of logical CPU cores minus one, with a floor of one worker. This calculation occurs in the default_num_workers() function within config.rs, ensuring baseline parallelism without overwhelming system resources.

Does increasing workers affect PDF rendering speed?

No. PDF rendering and spatial projection run sequentially regardless of the num_workers setting. The semaphore only governs the OCR stage in ocr_merge.rs, meaning rendering acts as the upstream bottleneck while OCR executes concurrently.

How does the semaphore prevent resource exhaustion?

The tokio::sync::Semaphore allocates a fixed number of permits equal to num_workers. Each page must acquire a permit before initiating an OCR request and releases it upon completion. This backpressure mechanism ensures memory usage remains bounded by preventing unlimited accumulation of rendered images awaiting OCR processing.

Can I disable concurrent processing for debugging purposes?

Set num_workers to 1 to force sequential OCR processing. This configuration still uses the async runtime but limits the semaphore to a single permit, effectively disabling parallelism. This is useful for deterministic debugging or when integrating with OCR services that require strict sequential access.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →