# How to Train Custom Language Models for Domain-Specific OCR with Tesseract

> Train custom Tesseract language models for domain specific OCR. Learn to fine-tune LSTM networks with unicharset and starter traineddata files for improved accuracy.

- Repository: [tesseract-ocr/tesseract](https://github.com/tesseract-ocr/tesseract)
- Tags: how-to-guide
- Published: 2026-03-02

---

**Training a domain-specific OCR model with Tesseract requires creating a unicharset, building a starter traineddata file, and fine-tuning an LSTM network using the `unicharset_extractor`, `combine_lang_model`, and `lstmtraining` utilities.**

The `tesseract-ocr/tesseract` repository provides a complete training pipeline for domain-specific OCR. By leveraging the source utilities in `src/training/`, you can build custom language models that recognize specialized vocabularies, scripts, or visual styles that differ from standard languages.

## Understanding the Tesseract Training Pipeline

The training process follows three distinct stages implemented in separate utilities:

1. **Character set extraction** – `unicharset_extractor` generates a `.unicharset` file containing all Unicode characters in your domain.
2. **Starter model creation** – `combine_lang_model` packages the unicharset with optional word lists and DAWGs into a `*.traineddata` file.
3. **LSTM fine-tuning** – `lstmtraining` optimizes the neural network using synthetic or real training images.

## Step 1: Extract the Character Set with unicharset_extractor

The `unicharset_extractor` utility, implemented in [`src/training/unicharset_extractor.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/training/unicharset_extractor.cpp), reads box files or plain text and normalizes the content before building the character set.

### How unicharset_extractor Works

The core function `AddStringsToUnicharset` processes input through `NormalizeCleanAndSegmentUTF8` to handle different normalization modes:

- `--norm_mode 1` for Latin scripts
- `--norm_mode 2` for Indic scripts
- `--norm_mode 3` for Arabic/Thai scripts

```bash

# Collect domain-specific text files in ./corpus/

unicharset_extractor \
    --output_unicharset ./langdata/custom/custom.unicharset \
    ./corpus/*.txt

```

This produces `custom.unicharset`, which lists every unique character required for your domain-specific OCR model.

## Step 2: Build the Starter Language Model with combine_lang_model

The `combine_lang_model` utility in [`src/training/combine_lang_model.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/training/combine_lang_model.cpp) orchestrates the creation of the initial traineddata file. It relies on helper functions in [`src/training/lang_model_helpers.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/training/lang_model_helpers.cpp) for DAWG generation and recoder creation.

### Creating DAWGs and Recoders

The utility processes three optional word list files:

- [`words.txt`](https://github.com/tesseract-ocr/tesseract/blob/main/words.txt) – vocabulary list for the language model DAWG
- [`puncs.txt`](https://github.com/tesseract-ocr/tesseract/blob/main/puncs.txt) – punctuation characters
- [`numbers.txt`](https://github.com/tesseract-ocr/tesseract/blob/main/numbers.txt) – numeric patterns

The `WriteRecoder` function handles optional character encoding compression, while `WriteDawgs` builds the deterministic finite-state automata for fast dictionary lookups.

```bash

# Prepare word lists (one entry per line)

cat ./wordlists/words.txt > ./langdata/custom/words.txt
cat ./wordlists/punc.txt > ./langdata/custom/puncs.txt
cat ./wordlists/numbers.txt > ./langdata/custom/numbers.txt

# Build the starter traineddata

combine_lang_model \
    --input_unicharset ./langdata/custom/custom.unicharset \
    --script_dir ./langdata \
    --output_dir ./output \
    --lang custom \
    --words ./langdata/custom/words.txt \
    --puncs ./langdata/custom/puncs.txt \
    --numbers ./langdata/custom/numbers.txt \
    --pass_through_recoder false

```

The output `./output/custom/custom.traineddata` contains the unicharset, recoder, and DAWGs, serving as the foundation for LSTM training.

## Step 3: Generate Synthetic Training Data with text2image

For domain-specific OCR, synthetic training data often provides the most consistent ground truth. The `text2image` utility in [`src/training/text2image.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/training/text2image.cpp) renders text corpora into image files with corresponding box data.

### Rendering Text to Images

The tool supports font degradation to simulate real-world scanning artifacts, controlled by the `--degrade_image` flag. Documentation in `doc/text2image.1.asc` details additional parameters for resolution and page layout.

```bash

# Generate synthetic training pages

text2image \
    --text ./corpus/all_text.txt \
    --outputbase ./train_data/custom_page \
    --fonts_dir /usr/share/fonts/truetype \
    --font "DejaVu Sans" \
    --resolution 300 \
    --degrade_image true

```

Convert these images to the LSTM training format:

```bash

# Create lstmf files from the generated images

tesseract ./train_data/custom_page.tif ./train_data/custom_page \
    -l custom --psm 6 lstm.train

# Collect all training files

find ./train_data -name "*.lstmf" > ./train_data/custom.training_files.txt

```

The `.lstmf` files contain serialized training data optimized for the `lstmtraining` utility.

## Step 4: Fine-Tune the LSTM Network with lstmtraining

The final stage uses `lstmtraining` in [`src/training/lstmtraining.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/training/lstmtraining.cpp) to optimize the neural network weights. This utility initializes the `LSTMTrainer` class from [`src/training/unicharset/lstmtrainer.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/training/unicharset/lstmtrainer.cpp) to handle the gradient descent loop.

### Training Loop and Checkpoints

The trainer loads the starter `traineddata` and iterates through the `.lstmf` files, adjusting weights to minimize character error rate. Key parameters include `--max_iterations` for training duration and `--target_error_rate` for early stopping.

```bash

# Fine-tune the LSTM network

lstmtraining \
    --model_output ./model/custom_lstm \
    --continue_from ./model/base_lstm \
    --traineddata ./output/custom/custom.traineddata \
    --train_listfile ./train_data/custom.training_files.txt \
    --max_iterations 5000 \
    --target_error_rate 0.01

```

Upon completion, the utility produces `custom_lstm.traineddata` containing the optimized LSTM model ready for inference.

## Key Source Files and Implementation Details

Understanding the underlying source code helps debug training issues and customize the pipeline:

| Component | Source File | Core Function |
|-----------|-------------|---------------|
| **Character extraction** | [`src/training/unicharset_extractor.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/training/unicharset_extractor.cpp) | `AddStringsToUnicharset` processes text through `NormalizeCleanAndSegmentUTF8` |
| **Model packaging** | [`src/training/combine_lang_model.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/training/combine_lang_model.cpp) | `CombineLangModel` orchestrates unicharset, recoder, and DAWG creation |
| **Helper utilities** | [`src/training/lang_model_helpers.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/training/lang_model_helpers.cpp) | `WriteRecoder` and `WriteDawgs` handle compression and dictionary automata |
| **Training driver** | [`src/training/lstmtraining.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/training/lstmtraining.cpp) | Command-line interface and checkpoint management |
| **Training engine** | [`src/training/unicharset/lstmtrainer.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/training/unicharset/lstmtrainer.cpp) | `LSTMTrainer` class implementing gradient descent and error calculation |
| **Synthetic data** | [`src/training/text2image.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/training/text2image.cpp) | Renders text corpora to degraded images for training data augmentation |

All utilities share the common flag-parsing infrastructure in [`commandlineflags.h`](https://github.com/tesseract-ocr/tesseract/blob/main/commandlineflags.h) and validate version compatibility through `tesseract::CheckSharedLibraryVersion()`.

## Summary

- **Extract the character set** using `unicharset_extractor` from [`src/training/unicharset_extractor.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/training/unicharset_extractor.cpp) to define all Unicode characters in your domain.
- **Build a starter model** with `combine_lang_model` to package the unicharset, optional word lists, and DAWGs into a `traineddata` file.
- **Generate training data** using `text2image` to create synthetic images with ground truth boxes, then convert them to `lstmf` format.
- **Fine-tune the LSTM** via `lstmtraining`, specifying iteration limits and target error rates to optimize the neural network for your specific domain.

## Frequently Asked Questions

### What is the minimum amount of training data needed for Tesseract LSTM training?

Tesseract LSTM training typically requires at least **4,000 to 5,000 lines** of text for fine-tuning existing models, while training from scratch may need **400,000+ lines** according to the `lstmtraining` implementation. The `LSTMTrainer` class in [`src/training/unicharset/lstmtrainer.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/training/unicharset/lstmtrainer.cpp) processes these through serialized `lstmf` files, and convergence depends on character diversity rather than just line count.

### Can I fine-tune an existing Tesseract model instead of training from scratch?

Yes, fine-tuning is the recommended approach for domain-specific OCR. Use the `--continue_from` flag with `lstmtraining` to load an existing `*.lstm` model file. The trainer in [`src/training/lstmtraining.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/training/lstmtraining.cpp) initializes the network weights from this checkpoint before processing your domain-specific `lstmf` files, allowing the model to retain general recognition capabilities while adapting to your specialized vocabulary.

### How do I handle rare characters or symbols in my domain-specific OCR model?

Rare characters require explicit inclusion in the unicharset during the extraction phase. Ensure your training corpus in `corpus/*.txt` contains examples of these symbols so that `unicharset_extractor` (specifically the `AddStringsToUnicharset` function in [`src/training/unicharset_extractor.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/training/unicharset_extractor.cpp)) includes them in the output. For extremely rare glyphs, consider oversampling their frequency in the training text or using `text2image` to generate additional synthetic examples with specific fonts that render these characters clearly.

### What file format does Tesseract use for training data?

Tesseract uses the **`lstmf`** format for LSTM training data. These files are serialized protocol buffer containers created when you run `tesseract` with the `lstm.train` config on image files. The `lstmtraining` utility in [`src/training/lstmtraining.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/training/lstmtraining.cpp) consumes these files through a list file (specified via `--train_listfile`), loading each `lstmf` entry into the `LSTMTrainer` for batch processing. Unlike the legacy box file format used in Tesseract 3.x, `lstmf` files bundle the image data, ground truth text, and character bounding boxes in a single efficient binary format.