How to Train Custom Language Models for Domain-Specific OCR with Tesseract
Training a domain-specific OCR model with Tesseract requires creating a unicharset, building a starter traineddata file, and fine-tuning an LSTM network using the unicharset_extractor, combine_lang_model, and lstmtraining utilities.
The tesseract-ocr/tesseract repository provides a complete training pipeline for domain-specific OCR. By leveraging the source utilities in src/training/, you can build custom language models that recognize specialized vocabularies, scripts, or visual styles that differ from standard languages.
Understanding the Tesseract Training Pipeline
The training process follows three distinct stages implemented in separate utilities:
- Character set extraction –
unicharset_extractorgenerates a.unicharsetfile containing all Unicode characters in your domain. - Starter model creation –
combine_lang_modelpackages the unicharset with optional word lists and DAWGs into a*.traineddatafile. - LSTM fine-tuning –
lstmtrainingoptimizes the neural network using synthetic or real training images.
Step 1: Extract the Character Set with unicharset_extractor
The unicharset_extractor utility, implemented in src/training/unicharset_extractor.cpp, reads box files or plain text and normalizes the content before building the character set.
How unicharset_extractor Works
The core function AddStringsToUnicharset processes input through NormalizeCleanAndSegmentUTF8 to handle different normalization modes:
--norm_mode 1for Latin scripts--norm_mode 2for Indic scripts--norm_mode 3for Arabic/Thai scripts
# Collect domain-specific text files in ./corpus/
unicharset_extractor \
--output_unicharset ./langdata/custom/custom.unicharset \
./corpus/*.txt
This produces custom.unicharset, which lists every unique character required for your domain-specific OCR model.
Step 2: Build the Starter Language Model with combine_lang_model
The combine_lang_model utility in src/training/combine_lang_model.cpp orchestrates the creation of the initial traineddata file. It relies on helper functions in src/training/lang_model_helpers.cpp for DAWG generation and recoder creation.
Creating DAWGs and Recoders
The utility processes three optional word list files:
words.txt– vocabulary list for the language model DAWGpuncs.txt– punctuation charactersnumbers.txt– numeric patterns
The WriteRecoder function handles optional character encoding compression, while WriteDawgs builds the deterministic finite-state automata for fast dictionary lookups.
# Prepare word lists (one entry per line)
cat ./wordlists/words.txt > ./langdata/custom/words.txt
cat ./wordlists/punc.txt > ./langdata/custom/puncs.txt
cat ./wordlists/numbers.txt > ./langdata/custom/numbers.txt
# Build the starter traineddata
combine_lang_model \
--input_unicharset ./langdata/custom/custom.unicharset \
--script_dir ./langdata \
--output_dir ./output \
--lang custom \
--words ./langdata/custom/words.txt \
--puncs ./langdata/custom/puncs.txt \
--numbers ./langdata/custom/numbers.txt \
--pass_through_recoder false
The output ./output/custom/custom.traineddata contains the unicharset, recoder, and DAWGs, serving as the foundation for LSTM training.
Step 3: Generate Synthetic Training Data with text2image
For domain-specific OCR, synthetic training data often provides the most consistent ground truth. The text2image utility in src/training/text2image.cpp renders text corpora into image files with corresponding box data.
Rendering Text to Images
The tool supports font degradation to simulate real-world scanning artifacts, controlled by the --degrade_image flag. Documentation in doc/text2image.1.asc details additional parameters for resolution and page layout.
# Generate synthetic training pages
text2image \
--text ./corpus/all_text.txt \
--outputbase ./train_data/custom_page \
--fonts_dir /usr/share/fonts/truetype \
--font "DejaVu Sans" \
--resolution 300 \
--degrade_image true
Convert these images to the LSTM training format:
# Create lstmf files from the generated images
tesseract ./train_data/custom_page.tif ./train_data/custom_page \
-l custom --psm 6 lstm.train
# Collect all training files
find ./train_data -name "*.lstmf" > ./train_data/custom.training_files.txt
The .lstmf files contain serialized training data optimized for the lstmtraining utility.
Step 4: Fine-Tune the LSTM Network with lstmtraining
The final stage uses lstmtraining in src/training/lstmtraining.cpp to optimize the neural network weights. This utility initializes the LSTMTrainer class from src/training/unicharset/lstmtrainer.cpp to handle the gradient descent loop.
Training Loop and Checkpoints
The trainer loads the starter traineddata and iterates through the .lstmf files, adjusting weights to minimize character error rate. Key parameters include --max_iterations for training duration and --target_error_rate for early stopping.
# Fine-tune the LSTM network
lstmtraining \
--model_output ./model/custom_lstm \
--continue_from ./model/base_lstm \
--traineddata ./output/custom/custom.traineddata \
--train_listfile ./train_data/custom.training_files.txt \
--max_iterations 5000 \
--target_error_rate 0.01
Upon completion, the utility produces custom_lstm.traineddata containing the optimized LSTM model ready for inference.
Key Source Files and Implementation Details
Understanding the underlying source code helps debug training issues and customize the pipeline:
| Component | Source File | Core Function |
|---|---|---|
| Character extraction | src/training/unicharset_extractor.cpp |
AddStringsToUnicharset processes text through NormalizeCleanAndSegmentUTF8 |
| Model packaging | src/training/combine_lang_model.cpp |
CombineLangModel orchestrates unicharset, recoder, and DAWG creation |
| Helper utilities | src/training/lang_model_helpers.cpp |
WriteRecoder and WriteDawgs handle compression and dictionary automata |
| Training driver | src/training/lstmtraining.cpp |
Command-line interface and checkpoint management |
| Training engine | src/training/unicharset/lstmtrainer.cpp |
LSTMTrainer class implementing gradient descent and error calculation |
| Synthetic data | src/training/text2image.cpp |
Renders text corpora to degraded images for training data augmentation |
All utilities share the common flag-parsing infrastructure in commandlineflags.h and validate version compatibility through tesseract::CheckSharedLibraryVersion().
Summary
- Extract the character set using
unicharset_extractorfromsrc/training/unicharset_extractor.cppto define all Unicode characters in your domain. - Build a starter model with
combine_lang_modelto package the unicharset, optional word lists, and DAWGs into atraineddatafile. - Generate training data using
text2imageto create synthetic images with ground truth boxes, then convert them tolstmfformat. - Fine-tune the LSTM via
lstmtraining, specifying iteration limits and target error rates to optimize the neural network for your specific domain.
Frequently Asked Questions
What is the minimum amount of training data needed for Tesseract LSTM training?
Tesseract LSTM training typically requires at least 4,000 to 5,000 lines of text for fine-tuning existing models, while training from scratch may need 400,000+ lines according to the lstmtraining implementation. The LSTMTrainer class in src/training/unicharset/lstmtrainer.cpp processes these through serialized lstmf files, and convergence depends on character diversity rather than just line count.
Can I fine-tune an existing Tesseract model instead of training from scratch?
Yes, fine-tuning is the recommended approach for domain-specific OCR. Use the --continue_from flag with lstmtraining to load an existing *.lstm model file. The trainer in src/training/lstmtraining.cpp initializes the network weights from this checkpoint before processing your domain-specific lstmf files, allowing the model to retain general recognition capabilities while adapting to your specialized vocabulary.
How do I handle rare characters or symbols in my domain-specific OCR model?
Rare characters require explicit inclusion in the unicharset during the extraction phase. Ensure your training corpus in corpus/*.txt contains examples of these symbols so that unicharset_extractor (specifically the AddStringsToUnicharset function in src/training/unicharset_extractor.cpp) includes them in the output. For extremely rare glyphs, consider oversampling their frequency in the training text or using text2image to generate additional synthetic examples with specific fonts that render these characters clearly.
What file format does Tesseract use for training data?
Tesseract uses the lstmf format for LSTM training data. These files are serialized protocol buffer containers created when you run tesseract with the lstm.train config on image files. The lstmtraining utility in src/training/lstmtraining.cpp consumes these files through a list file (specified via --train_listfile), loading each lstmf entry into the LSTMTrainer for batch processing. Unlike the legacy box file format used in Tesseract 3.x, lstmf files bundle the image data, ground truth text, and character bounding boxes in a single efficient binary format.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →