How to Migrate from the Legacy Tesseract OCR Engine to the LSTM Neural Network Engine

To migrate from the legacy Tesseract OCR engine to the LSTM neural network engine, pass OEM_LSTM_ONLY (value 1) to the Init function or use the --oem 1 command-line flag, ensuring your .traineddata file contains the LSTM component.

Tesseract 4.0 and later versions in the tesseract-ocr/tesseract repository ship with two distinct recognition back-ends: the legacy pattern-matching engine and the Long Short-Term Memory (LSTM) neural network engine. This guide explains the architectural differences, the specific source code paths that handle engine selection, and the practical steps to migrate your applications and workflows to the LSTM engine.

Understanding the OcrEngineMode Enumeration

The engine selection is controlled by the OcrEngineMode enum defined in include/tesseract/publictypes.h (lines 63–70):

enum OcrEngineMode {
  OEM_TESSERACT_ONLY,          // legacy engine only (deprecated)
  OEM_LSTM_ONLY,               // LSTM line recognizer only
  OEM_TESSERACT_LSTM_COMBINED, // LSTM with fallback to legacy (deprecated)
  OEM_DEFAULT,                 // infer from config / availability
  OEM_COUNT
};

OEM_LSTM_ONLY (integer value 1) forces the LSTM neural network engine and bypasses the legacy classifier entirely. OEM_DEFAULT (value 3) automatically selects the best available engine based on the components present in the language data file.

How Engine Selection Works in the Source Code

The API Entry Point in baseapi.cpp

When initializing Tesseract, the TessBaseAPI::Init function in src/api/baseapi.cpp (line 296) accepts the OcrEngineMode parameter:

int TessBaseAPI::Init(const char *datapath,
                      const char *language,
                      OcrEngineMode oem,
                      char **configs,
                      int configs_size,
                      std::vector<std::string> *vars_vec,
                      std::vector<std::string> *vars_values) {
  return Init(datapath, 0, language, oem, configs, configs_size,
              vars_vec, vars_values);
}

The oem argument is stored internally and passed to the initialization routine in src/ccmain/tessedit.cpp.

The Decision Logic in tessedit.cpp

The function SetVariablesAndInit in src/ccmain/tessedit.cpp (lines 100–108) determines the final engine mode:

if (oem == OEM_DEFAULT) {
  // Infer mode from what components are present in the traineddata.
  if (!mgr->IsLSTMAvailable())
    tessedit_ocr_engine_mode.set_value(OEM_TESSERACT_ONLY);
  else if (!mgr->IsBaseAvailable())
    tessedit_ocr_engine_mode.set_value(OEM_LSTM_ONLY);
  else
    tessedit_ocr_engine_mode.set_value(OEM_TESSERACT_LSTM_COMBINED);
}

If you explicitly request OEM_LSTM_ONLY, the code at lines 49–53 overrides any inferred value:

if (oem != OEM_DEFAULT) {
  tessedit_ocr_engine_mode.set_value(oem);
}

Loading the LSTM Recognizer

When OEM_LSTM_ONLY is active, the initialization code in src/ccmain/tessedit.cpp (lines 66–73) instantiates the neural network recognizer:

if (tessedit_ocr_engine_mode == OEM_LSTM_ONLY ||
    tessedit_ocr_engine_mode == OEM_TESSERACT_LSTM_COMBINED) {
  if (mgr->IsComponentAvailable(TESSDATA_LSTM)) {
    lstm_recognizer_ = new LSTMRecognizer(language_data_path_prefix.c_str());
    ASSERT_HOST(lstm_recognizer_->Load(this->params(),
                                       lstm_use_matrix ? language : "",
                                       mgr));
  } else {
    tprintf("Error: LSTM requested, but not present!! Loading tesseract.\n");
    tessedit_ocr_engine_mode.set_value(OEM_TESSERACT_ONLY);
  }
}

In pure LSTM mode (lines 81–84), the unicharset is sourced directly from the LSTM model rather than legacy components:

if (tessedit_ocr_engine_mode == OEM_LSTM_ONLY) {
  unicharset.CopyFrom(lstm_recognizer_->GetUnicharset());
}

Prerequisites for LSTM Migration

Before migrating, verify that your .traineddata file contains the LSTM component. The TessdataManager::IsLSTMAvailable() function checks for the presence of the TESSDATA_LSTM component within the archive. Modern language packs from the official tessdata repository include this component by default.

If you are using custom-trained models, ensure you have run combine_tessdata to package the LSTM file into the final .traineddata archive.

Step-by-Step Migration Guide

Command-Line Migration

The simplest migration path uses the --oem flag. In src/tesseract.cpp, the command-line parser maps --oem 1 to OEM_LSTM_ONLY:


# Legacy engine (default behavior)

tesseract input.png output.txt -l eng

# Migrate to LSTM only

tesseract input.png output.txt -l eng --oem 1

# LSTM with specific page segmentation mode

tesseract input.png output.txt -l eng --oem 1 --psm 6

C++ API Migration

Update your TessBaseAPI::Init call to include the OEM_LSTM_ONLY parameter from include/tesseract/publictypes.h:

#include <tesseract/baseapi.h>
#include <leptonica/allheaders.h>

int main() {
  tesseract::TessBaseAPI api;
  
  // Migrate by passing OEM_LSTM_ONLY (value 1)
  if (api.Init("/usr/local/share/tessdata", "eng", 
               tesseract::OEM_LSTM_ONLY) != 0) {
    fprintf(stderr, "Failed to initialize Tesseract with LSTM engine.\n");
    return 1;
  }

  Pix *image = pixRead("document.png");
  api.SetImage(image);
  char *outText = api.GetUTF8Text();
  printf("LSTM Output: %s\n", outText);
  
  api.End();
  pixDestroy(&image);
  delete[] outText;
  return 0;
}

Python Migration (pytesseract)

The Python wrapper forwards configuration strings directly to the binary. Pass --oem 1 via the config parameter:

import pytesseract
from PIL import Image

img = Image.open('invoice.png')

# Migrate to LSTM engine

text = pytesseract.image_to_string(
    img, 
    lang='eng', 
    config='--oem 1'
)
print(text)

Compile-Time Legacy Disabling

For production deployments that require only the LSTM engine, you can disable the legacy code entirely at compile time. This reduces binary size and prevents accidental fallback to the legacy engine.

Add the definition to your build configuration:

cmake -DDISABLED_LEGACY_ENGINE=ON ..

This flag activates preprocessor guards throughout the codebase (e.g., in src/ccmain/tessedit.cpp and src/classify/classify.cpp), removing legacy-specific code paths. When compiled with this flag, attempting to use OEM_TESSERACT_ONLY will result in an initialization error.

Troubleshooting Common Migration Issues

Error: "LSTM requested, but not present!!"

Symptom: Console output indicates fallback to legacy engine despite --oem 1.

Cause: The .traineddata file lacks the TESSDATA_LSTM component.

Fix: Download updated language data from the official tessdata repository. Verify the archive contains the LSTM component using combine_tessdata -l eng.traineddata.

Legacy Config Overrides LSTM Mode

Symptom: OEM_LSTM_ONLY is ignored when using custom config files.

Cause: A config file contains tessedit_ocr_engine_mode set to 0 (legacy).

Fix: Remove tessedit_ocr_engine_mode from your config files, or ensure it is set to 1 for LSTM-only operation.

Mixed Engine Artifacts in Output

Symptom: Inconsistent recognition results when using OEM_TESSERACT_LSTM_COMBINED.

Cause: The combined mode (value 2) is deprecated and can produce unpredictable fallback behavior.

Fix: Explicitly use OEM_LSTM_ONLY (value 1) for consistent neural network processing, or OEM_TESSERACT_ONLY (value 0) for legacy-only processing.

Summary

  • Tesseract 4+ supports two distinct engines: the legacy pattern matcher (OEM_TESSERACT_ONLY) and the LSTM neural network (OEM_LSTM_ONLY).
  • Migration is configuration-driven: Pass OEM_LSTM_ONLY (value 1) to TessBaseAPI::Init in src/api/baseapi.cpp, or use --oem 1 on the command line.
  • Data requirement: Ensure your .traineddata contains the LSTM component; the engine selection logic in src/ccmain/tessedit.cpp checks IsLSTMAvailable() and falls back to legacy if the component is missing.
  • Compile-time optimization: Use -DDISABLED_LEGACY_ENGINE to remove legacy code entirely and reduce binary size.
  • API consistency: The OcrEngineMode enum in include/tesseract/publictypes.h provides the canonical values used across the C++ API, command-line interface, and language bindings.

Frequently Asked Questions

What is the difference between OEM_LSTM_ONLY and OEM_TESSERACT_LSTM_COMBINED?

OEM_LSTM_ONLY (value 1) forces Tesseract to use exclusively the LSTM neural network engine, failing initialization if the LSTM component is absent from the language data. OEM_TESSERACT_LSTM_COMBINED (value 2) attempts to use LSTM first but falls back to the legacy engine when the neural network fails; however, this combined mode is deprecated in the source code and can produce inconsistent results. For modern deployments, use OEM_LSTM_ONLY to ensure deterministic neural network processing.

How do I verify that my traineddata file supports LSTM?

Run the combine_tessdata utility with the -l flag to list the components inside your .traineddata archive. If the output includes an entry for lstm, the file contains the LSTM neural network component required for OEM_LSTM_ONLY. Alternatively, attempt initialization with OEM_LSTM_ONLY; if TessBaseAPI::Init returns an error or the console logs "Error: LSTM requested, but not present!!", the LSTM component is missing from your language data.

Can I disable the legacy engine completely when building Tesseract?

Yes. Pass -DDISABLED_LEGACY_ENGINE=ON to CMake when compiling. This define activates preprocessor guards throughout the codebase—particularly in src/ccmain/tessedit.cpp and the classifier modules—removing the legacy Tesseract engine code entirely from the binary. When compiled this way, attempting to use OEM_TESSERACT_ONLY or OEM_TESSERACT_LSTM_COMBINED will result in an initialization error, ensuring only the LSTM neural network engine is available.

Why does my Python code still use the legacy engine after setting --oem 1?

The pytesseract wrapper passes the config string directly to the Tesseract binary, but ensure you are using a Tesseract 4.0+ binary and a language data file containing the LSTM component. If the underlying Tesseract installation is version 3.x, the --oem flag is unrecognized and defaults to legacy mode. Additionally, verify that no environment variable (TESSDATA_PREFIX) is pointing to an old language data directory lacking the LSTM neural network files.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →