# How to Configure LiteParse for Batch Parsing with Concurrent Workers

> Configure LiteParse for efficient batch parsing using concurrent workers. Set num_workers to control OCR tasks and optimize performance. Learn how to get started today.

- Repository: [LlamaIndex/liteparse](https://github.com/run-llama/liteparse)
- Tags: how-to-guide
- Published: 2026-05-30

---

**Set the `num_workers` parameter in `LiteParseConfig` to limit concurrent OCR tasks via an internal semaphore, defaulting to your CPU core count minus one.**

When processing large document collections with the `run-llama/liteparse` library, configuring concurrent workers is essential for optimizing throughput. LiteParse processes each document through three distinct stages—PDF rendering, optical character recognition (OCR), and spatial projection—but only the OCR phase benefits from parallelization. By adjusting the `num_workers` configuration, you control how many pages undergo simultaneous OCR processing, preventing resource exhaustion while maximizing batch parsing performance.

## Understanding the Worker Pool Architecture

LiteParse operates sequentially for rendering and spatial projection, but uses an asynchronous worker pool specifically for OCR operations. According to the source code in [ocr_merge.rs](https://github.com/run-llama/liteparse/blob/main/crates/liteparse/src/ocr_merge.rs), the library initializes a `tokio::sync::Semaphore` where the number of permits equals your configured `num_workers` value.

When OCR is enabled, the parsing flow follows this pattern:

1. **Render**: Each page converts to a raster image sequentially
2. **OCR**: Each rendered image acquires a semaphore permit before sending the request to the OCR engine
3. **Merge**: Results return and the permit releases for the next queued page

This architecture ensures that CPU-intensive rendering remains sequential while I/O-bound OCR requests execute concurrently up to your specified limit.

## Configuration Methods

You can configure `num_workers` through four interfaces depending on your implementation language.

### Command Line Interface (CLI)

The Rust binary accepts a `--num-workers` flag that propagates directly to the internal configuration. In [main.rs](https://github.com/run-llama/liteparse/blob/main/crates/liteparse/src/main.rs), the CLI argument parses into `cmd.num_workers` and maps to `config.num_workers` before initialization.

```bash

# Process a directory with 8 concurrent OCR workers

liteparse batch ./documents --num-workers 8 --output-format json

```

### Python SDK

The Python wrapper accepts `num_workers` as a constructor argument in [parser.py](https://github.com/run-llama/liteparse/blob/main/packages/python/liteparse/parser.py), forwarding the value to the underlying Rust configuration via `kwargs`.

```python
from liteparse import LiteParse

parser = LiteParse(
    ocr_enabled=True,
    num_workers=4,  # Limit to 4 concurrent OCR operations

    output_format="json"
)

results = parser.parse_folder("./invoices")
print(results)

```

### Node.js SDK

The Node.js wrapper exposes the same parameter through the `LiteParse` class options. The TypeScript definitions in [cli.ts](https://github.com/run-llama/liteparse/blob/main/packages/node/src/cli.ts) show the flag forwarding to the native binary.

```typescript
import { LiteParse } from "liteparse";

const parser = new LiteParse({
  ocrEnabled: true,
  numWorkers: 6,  // Controls the OCR concurrency limit
  outputFormat: "json",
});

const results = await parser.parseFolder("./documents");
console.log(results);

```

### Rust (Programmatic)

When using LiteParse directly in Rust, set the field on the `LiteParseConfig` struct before instantiating the parser. The field definition resides in [config.rs](https://github.com/run-llama/liteparse/blob/main/crates/liteparse/src/config.rs).

```rust
use liteparse::{LiteParse, LiteParseConfig, OutputFormat};

let config = LiteParseConfig {
    ocr_enabled: true,
    num_workers: 5,  // Custom worker count
    output_format: OutputFormat::Json,
    ..Default::default()
};

let parser = LiteParse::new(config);
let document = parser.parse("report.pdf").await?;

```

## Default Behavior and Semaphore Implementation

If you omit the `num_workers` parameter, LiteParse automatically calculates an optimal default. The function `default_num_workers()` in [config.rs](https://github.com/run-llama/liteparse/blob/main/crates/liteparse/src/config.rs) returns the number of logical CPU cores minus one, with a minimum of one worker to prevent deadlock.

The semaphore implementation in [ocr_merge.rs](https://github.com/run-llama/liteparse/blob/main/crates/liteparse/src/ocr_merge.rs) creates the concurrency boundary:

- Each page requiring OCR attempts to acquire a permit from the semaphore
- If all permits are active, the task yields until one becomes available
- Upon OCR completion, the permit releases back to the pool

This bounded concurrency prevents memory bloat from buffering too many rendered images while keeping the OCR pipeline saturated.

## Performance Considerations

**Adjust `num_workers` based on your OCR provider capacity.** While the default leverages available CPU cores, external OCR services (like cloud APIs) may impose rate limits that necessitate lower concurrency values.

**Memory scales linearly with worker count.** Each concurrent worker holds one rendered page image in memory until OCR completes. Reducing workers lowers peak memory usage during batch processing of high-resolution documents.

**Rendering remains single-threaded.** Increasing workers beyond CPU count provides diminishing returns since the rendering stage acts as the bottleneck upstream of the OCR pool.

## Summary

- Configure `num_workers` in `LiteParseConfig` to control OCR concurrency for batch parsing
- The default value equals CPU cores minus one, calculated in [config.rs](https://github.com/run-llama/liteparse/blob/main/crates/liteparse/src/config.rs)
- Internal implementation uses a `tokio::sync::Semaphore` in [ocr_merge.rs](https://github.com/run-llama/liteparse/blob/main/crates/liteparse/src/ocr_merge.rs) to limit simultaneous OCR requests
- Set the parameter via CLI (`--num-workers`), Python (`num_workers`), Node.js (`numWorkers`), or Rust struct initialization
- Balance throughput against memory usage and external API rate limits when tuning worker counts

## Frequently Asked Questions

### What is the default number of workers if I don't specify one?

LiteParse automatically sets `num_workers` to the number of logical CPU cores minus one, with a floor of one worker. This calculation occurs in the `default_num_workers()` function within [config.rs](https://github.com/run-llama/liteparse/blob/main/crates/liteparse/src/config.rs), ensuring baseline parallelism without overwhelming system resources.

### Does increasing workers affect PDF rendering speed?

No. PDF rendering and spatial projection run sequentially regardless of the `num_workers` setting. The semaphore only governs the OCR stage in [ocr_merge.rs](https://github.com/run-llama/liteparse/blob/main/crates/liteparse/src/ocr_merge.rs), meaning rendering acts as the upstream bottleneck while OCR executes concurrently.

### How does the semaphore prevent resource exhaustion?

The `tokio::sync::Semaphore` allocates a fixed number of permits equal to `num_workers`. Each page must acquire a permit before initiating an OCR request and releases it upon completion. This backpressure mechanism ensures memory usage remains bounded by preventing unlimited accumulation of rendered images awaiting OCR processing.

### Can I disable concurrent processing for debugging purposes?

Set `num_workers` to `1` to force sequential OCR processing. This configuration still uses the async runtime but limits the semaphore to a single permit, effectively disabling parallelism. This is useful for deterministic debugging or when integrating with OCR services that require strict sequential access.