How to Configure LiteParse for Batch Parsing with Concurrent Workers
Set the num_workers parameter in LiteParseConfig to limit concurrent OCR tasks via an internal semaphore, defaulting to your CPU core count minus one.
When processing large document collections with the run-llama/liteparse library, configuring concurrent workers is essential for optimizing throughput. LiteParse processes each document through three distinct stages—PDF rendering, optical character recognition (OCR), and spatial projection—but only the OCR phase benefits from parallelization. By adjusting the num_workers configuration, you control how many pages undergo simultaneous OCR processing, preventing resource exhaustion while maximizing batch parsing performance.
Understanding the Worker Pool Architecture
LiteParse operates sequentially for rendering and spatial projection, but uses an asynchronous worker pool specifically for OCR operations. According to the source code in ocr_merge.rs, the library initializes a tokio::sync::Semaphore where the number of permits equals your configured num_workers value.
When OCR is enabled, the parsing flow follows this pattern:
- Render: Each page converts to a raster image sequentially
- OCR: Each rendered image acquires a semaphore permit before sending the request to the OCR engine
- Merge: Results return and the permit releases for the next queued page
This architecture ensures that CPU-intensive rendering remains sequential while I/O-bound OCR requests execute concurrently up to your specified limit.
Configuration Methods
You can configure num_workers through four interfaces depending on your implementation language.
Command Line Interface (CLI)
The Rust binary accepts a --num-workers flag that propagates directly to the internal configuration. In main.rs, the CLI argument parses into cmd.num_workers and maps to config.num_workers before initialization.
# Process a directory with 8 concurrent OCR workers
liteparse batch ./documents --num-workers 8 --output-format json
Python SDK
The Python wrapper accepts num_workers as a constructor argument in parser.py, forwarding the value to the underlying Rust configuration via kwargs.
from liteparse import LiteParse
parser = LiteParse(
ocr_enabled=True,
num_workers=4, # Limit to 4 concurrent OCR operations
output_format="json"
)
results = parser.parse_folder("./invoices")
print(results)
Node.js SDK
The Node.js wrapper exposes the same parameter through the LiteParse class options. The TypeScript definitions in cli.ts show the flag forwarding to the native binary.
import { LiteParse } from "liteparse";
const parser = new LiteParse({
ocrEnabled: true,
numWorkers: 6, // Controls the OCR concurrency limit
outputFormat: "json",
});
const results = await parser.parseFolder("./documents");
console.log(results);
Rust (Programmatic)
When using LiteParse directly in Rust, set the field on the LiteParseConfig struct before instantiating the parser. The field definition resides in config.rs.
use liteparse::{LiteParse, LiteParseConfig, OutputFormat};
let config = LiteParseConfig {
ocr_enabled: true,
num_workers: 5, // Custom worker count
output_format: OutputFormat::Json,
..Default::default()
};
let parser = LiteParse::new(config);
let document = parser.parse("report.pdf").await?;
Default Behavior and Semaphore Implementation
If you omit the num_workers parameter, LiteParse automatically calculates an optimal default. The function default_num_workers() in config.rs returns the number of logical CPU cores minus one, with a minimum of one worker to prevent deadlock.
The semaphore implementation in ocr_merge.rs creates the concurrency boundary:
- Each page requiring OCR attempts to acquire a permit from the semaphore
- If all permits are active, the task yields until one becomes available
- Upon OCR completion, the permit releases back to the pool
This bounded concurrency prevents memory bloat from buffering too many rendered images while keeping the OCR pipeline saturated.
Performance Considerations
Adjust num_workers based on your OCR provider capacity. While the default leverages available CPU cores, external OCR services (like cloud APIs) may impose rate limits that necessitate lower concurrency values.
Memory scales linearly with worker count. Each concurrent worker holds one rendered page image in memory until OCR completes. Reducing workers lowers peak memory usage during batch processing of high-resolution documents.
Rendering remains single-threaded. Increasing workers beyond CPU count provides diminishing returns since the rendering stage acts as the bottleneck upstream of the OCR pool.
Summary
- Configure
num_workersinLiteParseConfigto control OCR concurrency for batch parsing - The default value equals CPU cores minus one, calculated in config.rs
- Internal implementation uses a
tokio::sync::Semaphorein ocr_merge.rs to limit simultaneous OCR requests - Set the parameter via CLI (
--num-workers), Python (num_workers), Node.js (numWorkers), or Rust struct initialization - Balance throughput against memory usage and external API rate limits when tuning worker counts
Frequently Asked Questions
What is the default number of workers if I don't specify one?
LiteParse automatically sets num_workers to the number of logical CPU cores minus one, with a floor of one worker. This calculation occurs in the default_num_workers() function within config.rs, ensuring baseline parallelism without overwhelming system resources.
Does increasing workers affect PDF rendering speed?
No. PDF rendering and spatial projection run sequentially regardless of the num_workers setting. The semaphore only governs the OCR stage in ocr_merge.rs, meaning rendering acts as the upstream bottleneck while OCR executes concurrently.
How does the semaphore prevent resource exhaustion?
The tokio::sync::Semaphore allocates a fixed number of permits equal to num_workers. Each page must acquire a permit before initiating an OCR request and releases it upon completion. This backpressure mechanism ensures memory usage remains bounded by preventing unlimited accumulation of rendered images awaiting OCR processing.
Can I disable concurrent processing for debugging purposes?
Set num_workers to 1 to force sequential OCR processing. This configuration still uses the async runtime but limits the semaphore to a single permit, effectively disabling parallelism. This is useful for deterministic debugging or when integrating with OCR services that require strict sequential access.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →