performance

How to Optimize Tesseract Performance for Large Document Batches

March 2, 2026 tesseract-ocr/tesseract ↗

Enable OpenMP support at compile time and set the tessedit_parallelize variable to 4 or higher to activate parallel word pre-classification and LSTM inference, reducing batch processing time by 2–5× on modern multi-core systems.

When processing thousands of pages from scanned archives or digitized books, the default single-threaded configuration of the tesseract-ocr/tesseract engine creates significant bottlenecks during image decoding, page segmentation, and neural network inference. By leveraging the internal OpenMP directives and batch-processing APIs found in the source tree, you can optimize Tesseract performance for large document batches while maintaining accuracy. This guide covers the specific compiler flags, runtime variables, and architectural patterns implemented in files such as src/ccmain/par_control.cpp and src/api/baseapi.cpp to maximize throughput.

Enable OpenMP Parallelism at Compile Time

Tesseract’s parallel processing capabilities rely on OpenMP directives embedded throughout the C++ source. Before you can utilize multi-threading, you must ensure the engine is built with OpenMP support enabled.

The build system detects OpenMP availability automatically in modern CMake configurations, but you should verify the flag is passed explicitly. According to src/tesseract.cpp, the binary prints _OPENMP at startup when compiled correctly【/cache/repos/github.com/tesseract-ocr/tesseract/main/src/tesseract.cpp】.

Compile from source with OpenMP enabled:

git clone https://github.com/tesseract-ocr/tesseract.git
cd tesseract
mkdir build && cd build
cmake .. -DENABLE_OPENMP=ON -DLSTM_NUM_THREADS=4
make -j$(nproc)
sudo make install

For GCC or Clang, this adds the -fopenmp flag. On MSVC, the equivalent /openmp flag is applied. Without this step, all parallelism variables discussed below will have no effect.

Configure the Global Parallelism Level

The tessedit_parallelize integer variable acts as the master switch for parallel execution paths. Defined in src/ccmain/tesseractclass.h and registered in src/ccmain/tesseractclass.cpp with a default value of 0【/cache/repos/github.com/tesseract-ocr/tesseract/main/src/ccmain/tesseractclass.h】【/cache/repos/github.com/tesseract-ocr/tesseract/main/src/ccmain/tesseractclass.cpp】, this setting controls whether the engine uses serial or parallel word pre-classification.

Set the variable via command line:

tesseract input.tif output -l eng -c tessedit_parallelize=4

The behavior follows these rules:

0 – Forces serial execution (default)
1 – Enables optional parallel paths but maintains serial execution for word pre-classification
>1 – Activates OpenMP loops in PrerecAllWordsPar within src/ccmain/par_control.cpp【/cache/repos/github.com/tesseract-ocr/tesseract/main/src/ccmain/par_control.cpp】, distributing blob classification across the specified thread count

For production batch jobs, set this to the number of physical cores available, typically between 4 and 16.

Optimize LSTM Inference Threading

While tessedit_parallelize controls high-level parallelism, the LSTM neural network engine uses a separate static constant kNumThreads defined in src/lstm/fullyconnected.cpp【/cache/repos/github.com/tesseract-ocr/tesseract/main/src/lstm/fullyconnected.cpp】. This parameter determines how many threads execute the #pragma omp parallel for blocks during matrix operations.

By default, kNumThreads is set to 4. Override it at compile time for machines with higher core counts:

cmake .. -DLSTM_NUM_THREADS=8

Alternatively, control thread allocation at runtime using the environment variable OMP_NUM_THREADS, which bounds all OpenMP regions including the LSTM kernels:

export OMP_NUM_THREADS=8
tesseract batch_list.txt stdout -l eng -c tessedit_parallelize=8

Implement Efficient Batch Processing Strategies

For true high-throughput batch OCR, minimize per-process initialization overhead by using the file list API rather than invoking Tesseract once per image.

Process Documents via File Lists

The ProcessPagesFileList function in src/api/baseapi.cpp reads a newline-delimited text file containing image paths and processes them sequentially within a single Tesseract instance【/cache/repos/github.com/tesseract-ocr/tesseract/main/src/api/baseapi.cpp】. This avoids reloading language models and allocating caches for every page.

Create a list of files:

ls -1 /path/to/scanned_pages/*.tif > filelist.txt

Execute batch processing:

tesseract filelist.txt output -l eng -c tessedit_parallelize=4

Parallelize with External Workers

When CPU utilization remains below capacity, launch multiple Tesseract processes operating on disjoint file list segments. Because each worker loads the same language data into shared memory, the total memory overhead scales sub-linearly with process count.

Split the workload and run in parallel:

split -l 500 filelist.txt part_
for part in part_*; do
  (tesseract "$part" stdout -l eng -c tessedit_parallelize=4 > "${part}.out") &
done
wait
cat part_*.out > final_output.txt

Handle Multi-Page TIFFs Efficiently

Multi-page TIFFs force Tesseract to seek through the file for each page, creating I/O bottlenecks. The ProcessPagesInternal logic in src/api/baseapi.cpp handles these sequentially【/cache/repos/github.com/tesseract-ocr/tesseract/main/src/api/baseapi.cpp】, but performance improves when you split the TIFF into individual pages first:

tiffsplit -b page_ input.tif
ls page_*.tif > pages.txt
tesseract pages.txt stdout -l eng -c tessedit_parallelize=4

Fine-Tune Runtime Parameters for Speed

Beyond parallelism, disable unnecessary linguistic analysis to reduce per-page CPU cycles. Set these parameters via the -c flag or a config file:

tessedit_pageseg_mode – Use mode 3 (auto) or 6 (uniform block) instead of full orientation detection. Pass as --psm 3.
tessedit_ocr_engine_mode – Use mode 1 (LSTM only) for modern fonts. Pass as --oem 1.
load_system_dawg / load_freq_dawg – Set to 0 to disable dictionary lookups if your documents contain specialized vocabulary lacking dictionary coverage.
classify_bln_numeric_mode – Set to 1 only if documents contain no numeric data, bypassing the numeric classifier pass.

C++ API Implementation with Parallel Flags

For custom applications, enable parallelism programmatically using the base API:

#include <tesseract/baseapi.h>
#include <leptonica/allheaders.h>

int main() {
  tesseract::TessBaseAPI api;
  if (api.Init(nullptr, "eng")) return 1;

  // Enable parallel word pre-classification
  api.SetVariable("tessedit_parallelize", "4");
  
  // Optimize segmentation for uniform text blocks
  api.SetVariable("tessedit_pageseg_mode", "3");

  // Process batch list via ProcessPagesFileList
  if (!api.ProcessPages("filelist.txt", nullptr, 0, nullptr)) {
    fprintf(stderr, "Batch processing failed\n");
    return 1;
  }
  api.End();
  return 0;
}

This implementation honors the tessedit_parallelize setting within ProcessPages, utilizing the same OpenMP paths as the CLI【/cache/repos/github.com/tesseract-ocr/tesseract/main/src/api/baseapi.cpp】.

Summary

To optimize Tesseract performance for large document batches, implement these architectural adjustments:

Compile with OpenMP support using -DENABLE_OPENMP=ON to unlock parallel primitives in src/ccmain/par_control.cpp and src/lstm/fullyconnected.cpp
Set tessedit_parallelize to your core count (values > 1) to enable PrerecAllWordsPar parallelization
Adjust LSTM threads via -DLSTM_NUM_THREADS or OMP_NUM_THREADS for matrix operation scaling
Use file lists with ProcessPagesFileList to amortize model initialization across thousands of images
Split multi-page TIFFs before processing to eliminate seek latency
Tune PSM/OEM modes to skip unnecessary layout analysis when document structure is known

Frequently Asked Questions

What is the optimal value for `tessedit_parallelize` in batch processing?

Set tessedit_parallelize equal to the number of physical CPU cores on your machine, typically between 4 and 16. Values greater than 1 activate the OpenMP parallel for loops in PrerecAllWordsPar, while the LSTM layer respects the separate kNumThreads constant or OMP_NUM_THREADS environment variable. Setting this too high on I/O-bound systems may degrade performance due to thread contention.

Does enabling OpenMP increase memory usage?

Yes, but modestly. Each OpenMP thread allocates private buffers for blob classification and LSTM inference. In practice, a worker process uses approximately 500 MB with the default English LSTM model, scaling sub-linearly when processing file lists because language data is shared across pages. External parallelization via multiple processes provides better isolation if memory is constrained.

Can I process multiple PDF files in parallel using the Tesseract CLI?

The Tesseract CLI processes file lists sequentially within ProcessPagesFileList. To parallelize across multiple PDFs or TIFFs, split your file list into chunks and spawn separate Tesseract processes, or use the C++ API to implement a thread pool calling ProcessPages on different file list segments. The engine does not natively support multiple simultaneous document streams within a single instance.

Should I use LSTM-only mode (`--oem 1`) for batch speed?

Generally yes. The LSTM-only engine mode (--oem 1) is faster and more accurate for modern printed text than the legacy Tesseract engine (--oem 0) or the combined mode (--oem 3). According to the implementation in src/lstm/fullyconnected.cpp, the LSTM paths utilize the kNumThreads parallelism, making them ideal for batch optimization when accuracy requirements permit neural network-only recognition.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:

curl -s "https://instagit.com/install.md"

Add to your MCP client configuration:

{
  "mcpServers": {
    "instagit": {
      "command": "npx",
      "args": ["-y", "instagit@latest"]
    }
  }
}

Ask your agent:

"Use Instagit MCP to understand how tesseract-ocr/tesseract works."

Works with

Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →