How to Configure the num_workers Parameter for Concurrent OCR Processing in LiteParse
Set the num_workers parameter in LiteParse to control how many OCR tasks run concurrently via the --num-workers CLI flag, the num_workers constructor argument in Python/Node.js, or the LiteParseConfig struct in Rust, with a default value of max(1, num_cpus() - 1).
LiteParse, an open-source document parsing library from run-llama/liteparse, uses optical character recognition (OCR) only when native PDF text extraction fails, making the num_workers parameter essential for controlling CPU-intensive processing concurrency. This setting determines how many pages are processed simultaneously across the library's Rust core, bindings, and deployment targets.
Understanding the num_workers Parameter
The num_workers setting defines the size of the internal worker pool that executes OCR operations when LiteParse encounters scanned pages or image-based content. According to the source code in crates/liteparse/src/config.rs, the LiteParseConfig struct exposes this value as a public field:
pub struct LiteParseConfig {
pub num_workers: usize,
// ... additional configuration fields
}
When you do not specify a value, LiteParse automatically calculates the default as max(1, num_cpus() - 1), reserving one CPU core for system operations while utilizing the remainder for OCR tasks. This logic is implemented in the default_num_workers() helper function within lines 59-60 of the configuration file.
Configuration Methods Across Runtimes
Rust Core and CLI
In the Rust implementation, you can configure concurrency programmatically or via command-line arguments. The binary entry point in crates/liteparse/src/main.rs accepts an optional --num-workers flag that overrides the default (lines 89-101):
# Use default (CPU cores - 1)
liteparse input.pdf --output output.json
# Limit to 2 concurrent workers
liteparse input.pdf --output output.json --num-workers 2
For library consumers, modify the LiteParseConfig struct before instantiating the parser:
use liteparse::{LiteParse, LiteParseConfig};
let mut cfg = LiteParseConfig::default();
cfg.num_workers = 4; // Explicit concurrency limit
let parser = LiteParse::new("document.pdf", cfg);
let result = parser.parse().await?;
Python Bindings
The Python API exposes num_workers as a constructor parameter in crates/liteparse-python/src/lib.rs (lines 216-333). Pass an integer to the LiteParse class to override the default:
from liteparse import LiteParse
# Default behavior (CPU cores - 1)
parser = LiteParse("input.pdf")
result = parser.parse()
# Custom worker count
parser = LiteParse("input.pdf", num_workers=3)
result = parser.parse()
Node.js Bindings
For JavaScript and TypeScript applications using the N-API bindings defined in crates/liteparse-napi/src/types.rs (lines 36-40), supply the num_workers option in the constructor:
const { LiteParse } = require("liteparse");
# Default concurrency
const parser = new LiteParse("input.pdf");
# Constrained processing
const parserLimited = new LiteParse("input.pdf", { num_workers: 2 });
WebAssembly Constraints
When using LiteParse in browser environments via WebAssembly, concurrency is forcibly limited to a single worker regardless of the configuration value. The WASM shim in crates/liteparse-wasm/src/lib.rs (lines 98-100) explicitly sets this restriction due to browser threading limitations and SharedArrayBuffer constraints.
Internal Concurrency Implementation
The actual limitation of concurrent OCR tasks is enforced in crates/liteparse/src/ocr_merge.rs using a Tokio semaphore initialized with the num_workers value (lines 78-84). When the parser in crates/liteparse/src/parser.rs (lines 170-182) encounters pages requiring OCR, it acquires permits from this semaphore before spawning OCR tasks, ensuring the system never exceeds the configured parallelism limit.
Performance Considerations
Adjusting the num_workers value involves balancing speed against resource utilization:
- Higher values increase throughput on multi-core machines but raise memory consumption and CPU saturation risk. This configuration suits dedicated servers processing large document batches.
- Lower values provide predictable resource usage and prevent system thrashing, making them ideal for containerized environments with CPU quotas or CI/CD pipelines.
Summary
- The
num_workersparameter controls OCR concurrency in LiteParse, defaulting tomax(1, num_cpus() - 1)as implemented incrates/liteparse/src/config.rs. - Configuration varies by interface:
--num-workersfor CLI, constructor arguments for Python and Node.js, and theLiteParseConfigstruct for Rust. - The runtime enforces limits via a Tokio semaphore in
ocr_merge.rsafter receiving the value fromparser.rs. - WebAssembly builds ignore this setting and always use a single worker due to browser limitations.
- Adjust this value based on available CPU cores and memory constraints to optimize document processing throughput.
Frequently Asked Questions
What is the default value of num_workers in LiteParse?
The default value is automatically calculated as the number of logical CPU cores minus one, with a minimum of one worker. This calculation occurs in crates/liteparse/src/config.rs and ensures LiteParse leaves one core available for system operations while maximizing OCR throughput.
Can I use multiple OCR workers in the browser with LiteParse WASM?
No. The WebAssembly build in crates/liteparse-wasm/src/lib.rs hardcodes the worker count to one due to browser threading limitations and the lack of true parallelism in standard WASM environments, regardless of the num_workers value passed in configuration.
Why does LiteParse show high memory usage when I increase num_workers?
Each concurrent OCR worker maintains independent memory buffers for image processing and text recognition. Increasing num_workers creates more simultaneous OCR contexts, which multiplies memory consumption proportionally. Reduce this value if you encounter out-of-memory errors in resource-constrained environments.
How do I check the actual number of workers being used at runtime?
When initializing LiteParse programmatically in Rust, inspect the config.num_workers field after construction. For CLI usage, verbose logging modes may expose configuration details, or you can monitor system process activity to observe the actual concurrency level during PDF processing.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →