how-to-guide

How to Train Custom Language Models for Domain-Specific OCR with Tesseract

March 2, 2026 tesseract-ocr/tesseract ↗

Training a domain-specific OCR model with Tesseract requires creating a unicharset, building a starter traineddata file, and fine-tuning an LSTM network using the unicharset_extractor, combine_lang_model, and lstmtraining utilities.

The tesseract-ocr/tesseract repository provides a complete training pipeline for domain-specific OCR. By leveraging the source utilities in src/training/, you can build custom language models that recognize specialized vocabularies, scripts, or visual styles that differ from standard languages.

Understanding the Tesseract Training Pipeline

The training process follows three distinct stages implemented in separate utilities:

Character set extraction – unicharset_extractor generates a .unicharset file containing all Unicode characters in your domain.
Starter model creation – combine_lang_model packages the unicharset with optional word lists and DAWGs into a *.traineddata file.
LSTM fine-tuning – lstmtraining optimizes the neural network using synthetic or real training images.

Step 1: Extract the Character Set with unicharset_extractor

The unicharset_extractor utility, implemented in src/training/unicharset_extractor.cpp, reads box files or plain text and normalizes the content before building the character set.

How unicharset_extractor Works

The core function AddStringsToUnicharset processes input through NormalizeCleanAndSegmentUTF8 to handle different normalization modes:

--norm_mode 1 for Latin scripts
--norm_mode 2 for Indic scripts
--norm_mode 3 for Arabic/Thai scripts


# Collect domain-specific text files in ./corpus/

unicharset_extractor \
    --output_unicharset ./langdata/custom/custom.unicharset \
    ./corpus/*.txt

This produces custom.unicharset, which lists every unique character required for your domain-specific OCR model.

Step 2: Build the Starter Language Model with combine_lang_model

The combine_lang_model utility in src/training/combine_lang_model.cpp orchestrates the creation of the initial traineddata file. It relies on helper functions in src/training/lang_model_helpers.cpp for DAWG generation and recoder creation.

Creating DAWGs and Recoders

The utility processes three optional word list files:

words.txt – vocabulary list for the language model DAWG
puncs.txt – punctuation characters
numbers.txt – numeric patterns

The WriteRecoder function handles optional character encoding compression, while WriteDawgs builds the deterministic finite-state automata for fast dictionary lookups.


# Prepare word lists (one entry per line)

cat ./wordlists/words.txt > ./langdata/custom/words.txt
cat ./wordlists/punc.txt > ./langdata/custom/puncs.txt
cat ./wordlists/numbers.txt > ./langdata/custom/numbers.txt

# Build the starter traineddata

combine_lang_model \
    --input_unicharset ./langdata/custom/custom.unicharset \
    --script_dir ./langdata \
    --output_dir ./output \
    --lang custom \
    --words ./langdata/custom/words.txt \
    --puncs ./langdata/custom/puncs.txt \
    --numbers ./langdata/custom/numbers.txt \
    --pass_through_recoder false

The output ./output/custom/custom.traineddata contains the unicharset, recoder, and DAWGs, serving as the foundation for LSTM training.

Step 3: Generate Synthetic Training Data with text2image

For domain-specific OCR, synthetic training data often provides the most consistent ground truth. The text2image utility in src/training/text2image.cpp renders text corpora into image files with corresponding box data.

Rendering Text to Images

The tool supports font degradation to simulate real-world scanning artifacts, controlled by the --degrade_image flag. Documentation in doc/text2image.1.asc details additional parameters for resolution and page layout.


# Generate synthetic training pages

text2image \
    --text ./corpus/all_text.txt \
    --outputbase ./train_data/custom_page \
    --fonts_dir /usr/share/fonts/truetype \
    --font "DejaVu Sans" \
    --resolution 300 \
    --degrade_image true

Convert these images to the LSTM training format:


# Create lstmf files from the generated images

tesseract ./train_data/custom_page.tif ./train_data/custom_page \
    -l custom --psm 6 lstm.train

# Collect all training files

find ./train_data -name "*.lstmf" > ./train_data/custom.training_files.txt

The .lstmf files contain serialized training data optimized for the lstmtraining utility.

Step 4: Fine-Tune the LSTM Network with lstmtraining

The final stage uses lstmtraining in src/training/lstmtraining.cpp to optimize the neural network weights. This utility initializes the LSTMTrainer class from src/training/unicharset/lstmtrainer.cpp to handle the gradient descent loop.

Training Loop and Checkpoints

The trainer loads the starter traineddata and iterates through the .lstmf files, adjusting weights to minimize character error rate. Key parameters include --max_iterations for training duration and --target_error_rate for early stopping.


# Fine-tune the LSTM network

lstmtraining \
    --model_output ./model/custom_lstm \
    --continue_from ./model/base_lstm \
    --traineddata ./output/custom/custom.traineddata \
    --train_listfile ./train_data/custom.training_files.txt \
    --max_iterations 5000 \
    --target_error_rate 0.01

Upon completion, the utility produces custom_lstm.traineddata containing the optimized LSTM model ready for inference.

Key Source Files and Implementation Details

Understanding the underlying source code helps debug training issues and customize the pipeline:

Component	Source File	Core Function
Character extraction	`src/training/unicharset_extractor.cpp`	`AddStringsToUnicharset` processes text through `NormalizeCleanAndSegmentUTF8`
Model packaging	`src/training/combine_lang_model.cpp`	`CombineLangModel` orchestrates unicharset, recoder, and DAWG creation
Helper utilities	`src/training/lang_model_helpers.cpp`	`WriteRecoder` and `WriteDawgs` handle compression and dictionary automata
Training driver	`src/training/lstmtraining.cpp`	Command-line interface and checkpoint management
Training engine	`src/training/unicharset/lstmtrainer.cpp`	`LSTMTrainer` class implementing gradient descent and error calculation
Synthetic data	`src/training/text2image.cpp`	Renders text corpora to degraded images for training data augmentation

All utilities share the common flag-parsing infrastructure in commandlineflags.h and validate version compatibility through tesseract::CheckSharedLibraryVersion().

Summary

Extract the character set using unicharset_extractor from src/training/unicharset_extractor.cpp to define all Unicode characters in your domain.
Build a starter model with combine_lang_model to package the unicharset, optional word lists, and DAWGs into a traineddata file.
Generate training data using text2image to create synthetic images with ground truth boxes, then convert them to lstmf format.
Fine-tune the LSTM via lstmtraining, specifying iteration limits and target error rates to optimize the neural network for your specific domain.

Frequently Asked Questions

What is the minimum amount of training data needed for Tesseract LSTM training?

Tesseract LSTM training typically requires at least 4,000 to 5,000 lines of text for fine-tuning existing models, while training from scratch may need 400,000+ lines according to the lstmtraining implementation. The LSTMTrainer class in src/training/unicharset/lstmtrainer.cpp processes these through serialized lstmf files, and convergence depends on character diversity rather than just line count.

Can I fine-tune an existing Tesseract model instead of training from scratch?

Yes, fine-tuning is the recommended approach for domain-specific OCR. Use the --continue_from flag with lstmtraining to load an existing *.lstm model file. The trainer in src/training/lstmtraining.cpp initializes the network weights from this checkpoint before processing your domain-specific lstmf files, allowing the model to retain general recognition capabilities while adapting to your specialized vocabulary.

How do I handle rare characters or symbols in my domain-specific OCR model?

Rare characters require explicit inclusion in the unicharset during the extraction phase. Ensure your training corpus in corpus/*.txt contains examples of these symbols so that unicharset_extractor (specifically the AddStringsToUnicharset function in src/training/unicharset_extractor.cpp) includes them in the output. For extremely rare glyphs, consider oversampling their frequency in the training text or using text2image to generate additional synthetic examples with specific fonts that render these characters clearly.

What file format does Tesseract use for training data?

Tesseract uses the lstmf format for LSTM training data. These files are serialized protocol buffer containers created when you run tesseract with the lstm.train config on image files. The lstmtraining utility in src/training/lstmtraining.cpp consumes these files through a list file (specified via --train_listfile), loading each lstmf entry into the LSTMTrainer for batch processing. Unlike the legacy box file format used in Tesseract 3.x, lstmf files bundle the image data, ground truth text, and character bounding boxes in a single efficient binary format.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:

curl -s "https://instagit.com/install.md"

Add to your MCP client configuration:

{
  "mcpServers": {
    "instagit": {
      "command": "npx",
      "args": ["-y", "instagit@latest"]
    }
  }
}

Ask your agent:

"Use Instagit MCP to understand how tesseract-ocr/tesseract works."

Works with

Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →