# How to Migrate from the Legacy Tesseract OCR Engine to the LSTM Neural Network Engine

> Easily migrate Tesseract OCR from legacy to LSTM neural network engine. Learn the simple Init function or command-line flag changes for faster, more accurate text recognition.

- Repository: [tesseract-ocr/tesseract](https://github.com/tesseract-ocr/tesseract)
- Tags: migration-guide
- Published: 2026-03-02

---

**To migrate from the legacy Tesseract OCR engine to the LSTM neural network engine, pass `OEM_LSTM_ONLY` (value `1`) to the `Init` function or use the `--oem 1` command-line flag, ensuring your `.traineddata` file contains the LSTM component.**

Tesseract 4.0 and later versions in the `tesseract-ocr/tesseract` repository ship with two distinct recognition back-ends: the legacy pattern-matching engine and the Long Short-Term Memory (LSTM) neural network engine. This guide explains the architectural differences, the specific source code paths that handle engine selection, and the practical steps to migrate your applications and workflows to the LSTM engine.

## Understanding the OcrEngineMode Enumeration

The engine selection is controlled by the `OcrEngineMode` enum defined in **[`include/tesseract/publictypes.h`](https://github.com/tesseract-ocr/tesseract/blob/main/include/tesseract/publictypes.h)** (lines 63–70):

```cpp
enum OcrEngineMode {
  OEM_TESSERACT_ONLY,          // legacy engine only (deprecated)
  OEM_LSTM_ONLY,               // LSTM line recognizer only
  OEM_TESSERACT_LSTM_COMBINED, // LSTM with fallback to legacy (deprecated)
  OEM_DEFAULT,                 // infer from config / availability
  OEM_COUNT
};

```

**`OEM_LSTM_ONLY`** (integer value `1`) forces the LSTM neural network engine and bypasses the legacy classifier entirely. **`OEM_DEFAULT`** (value `3`) automatically selects the best available engine based on the components present in the language data file.

## How Engine Selection Works in the Source Code

### The API Entry Point in baseapi.cpp

When initializing Tesseract, the `TessBaseAPI::Init` function in **[`src/api/baseapi.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/api/baseapi.cpp)** (line 296) accepts the `OcrEngineMode` parameter:

```cpp
int TessBaseAPI::Init(const char *datapath,
                      const char *language,
                      OcrEngineMode oem,
                      char **configs,
                      int configs_size,
                      std::vector<std::string> *vars_vec,
                      std::vector<std::string> *vars_values) {
  return Init(datapath, 0, language, oem, configs, configs_size,
              vars_vec, vars_values);
}

```

The `oem` argument is stored internally and passed to the initialization routine in [`src/ccmain/tessedit.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/ccmain/tessedit.cpp).

### The Decision Logic in tessedit.cpp

The function `SetVariablesAndInit` in **[`src/ccmain/tessedit.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/ccmain/tessedit.cpp)** (lines 100–108) determines the final engine mode:

```cpp
if (oem == OEM_DEFAULT) {
  // Infer mode from what components are present in the traineddata.
  if (!mgr->IsLSTMAvailable())
    tessedit_ocr_engine_mode.set_value(OEM_TESSERACT_ONLY);
  else if (!mgr->IsBaseAvailable())
    tessedit_ocr_engine_mode.set_value(OEM_LSTM_ONLY);
  else
    tessedit_ocr_engine_mode.set_value(OEM_TESSERACT_LSTM_COMBINED);
}

```

If you explicitly request `OEM_LSTM_ONLY`, the code at lines 49–53 overrides any inferred value:

```cpp
if (oem != OEM_DEFAULT) {
  tessedit_ocr_engine_mode.set_value(oem);
}

```

### Loading the LSTM Recognizer

When `OEM_LSTM_ONLY` is active, the initialization code in **[`src/ccmain/tessedit.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/ccmain/tessedit.cpp)** (lines 66–73) instantiates the neural network recognizer:

```cpp
if (tessedit_ocr_engine_mode == OEM_LSTM_ONLY ||
    tessedit_ocr_engine_mode == OEM_TESSERACT_LSTM_COMBINED) {
  if (mgr->IsComponentAvailable(TESSDATA_LSTM)) {
    lstm_recognizer_ = new LSTMRecognizer(language_data_path_prefix.c_str());
    ASSERT_HOST(lstm_recognizer_->Load(this->params(),
                                       lstm_use_matrix ? language : "",
                                       mgr));
  } else {
    tprintf("Error: LSTM requested, but not present!! Loading tesseract.\n");
    tessedit_ocr_engine_mode.set_value(OEM_TESSERACT_ONLY);
  }
}

```

In pure LSTM mode (lines 81–84), the unicharset is sourced directly from the LSTM model rather than legacy components:

```cpp
if (tessedit_ocr_engine_mode == OEM_LSTM_ONLY) {
  unicharset.CopyFrom(lstm_recognizer_->GetUnicharset());
}

```

## Prerequisites for LSTM Migration

Before migrating, verify that your **`.traineddata`** file contains the LSTM component. The `TessdataManager::IsLSTMAvailable()` function checks for the presence of the `TESSDATA_LSTM` component within the archive. Modern language packs from the official `tessdata` repository include this component by default.

If you are using custom-trained models, ensure you have run `combine_tessdata` to package the LSTM file into the final `.traineddata` archive.

## Step-by-Step Migration Guide

### Command-Line Migration

The simplest migration path uses the `--oem` flag. In [`src/tesseract.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/tesseract.cpp), the command-line parser maps `--oem 1` to `OEM_LSTM_ONLY`:

```bash

# Legacy engine (default behavior)

tesseract input.png output.txt -l eng

# Migrate to LSTM only

tesseract input.png output.txt -l eng --oem 1

# LSTM with specific page segmentation mode

tesseract input.png output.txt -l eng --oem 1 --psm 6

```

### C++ API Migration

Update your `TessBaseAPI::Init` call to include the `OEM_LSTM_ONLY` parameter from **[`include/tesseract/publictypes.h`](https://github.com/tesseract-ocr/tesseract/blob/main/include/tesseract/publictypes.h)**:

```cpp
#include <tesseract/baseapi.h>
#include <leptonica/allheaders.h>

int main() {
  tesseract::TessBaseAPI api;
  
  // Migrate by passing OEM_LSTM_ONLY (value 1)
  if (api.Init("/usr/local/share/tessdata", "eng", 
               tesseract::OEM_LSTM_ONLY) != 0) {
    fprintf(stderr, "Failed to initialize Tesseract with LSTM engine.\n");
    return 1;
  }

  Pix *image = pixRead("document.png");
  api.SetImage(image);
  char *outText = api.GetUTF8Text();
  printf("LSTM Output: %s\n", outText);
  
  api.End();
  pixDestroy(&image);
  delete[] outText;
  return 0;
}

```

### Python Migration (pytesseract)

The Python wrapper forwards configuration strings directly to the binary. Pass `--oem 1` via the `config` parameter:

```python
import pytesseract
from PIL import Image

img = Image.open('invoice.png')

# Migrate to LSTM engine

text = pytesseract.image_to_string(
    img, 
    lang='eng', 
    config='--oem 1'
)
print(text)

```

## Compile-Time Legacy Disabling

For production deployments that require only the LSTM engine, you can disable the legacy code entirely at compile time. This reduces binary size and prevents accidental fallback to the legacy engine.

Add the definition to your build configuration:

```bash
cmake -DDISABLED_LEGACY_ENGINE=ON ..

```

This flag activates preprocessor guards throughout the codebase (e.g., in [`src/ccmain/tessedit.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/ccmain/tessedit.cpp) and [`src/classify/classify.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/classify/classify.cpp)), removing legacy-specific code paths. When compiled with this flag, attempting to use `OEM_TESSERACT_ONLY` will result in an initialization error.

## Troubleshooting Common Migration Issues

### Error: "LSTM requested, but not present!!"

**Symptom:** Console output indicates fallback to legacy engine despite `--oem 1`.

**Cause:** The `.traineddata` file lacks the `TESSDATA_LSTM` component.

**Fix:** Download updated language data from the official `tessdata` repository. Verify the archive contains the LSTM component using `combine_tessdata -l eng.traineddata`.

### Legacy Config Overrides LSTM Mode

**Symptom:** `OEM_LSTM_ONLY` is ignored when using custom config files.

**Cause:** A config file contains `tessedit_ocr_engine_mode` set to `0` (legacy).

**Fix:** Remove `tessedit_ocr_engine_mode` from your config files, or ensure it is set to `1` for LSTM-only operation.

### Mixed Engine Artifacts in Output

**Symptom:** Inconsistent recognition results when using `OEM_TESSERACT_LSTM_COMBINED`.

**Cause:** The combined mode (value `2`) is deprecated and can produce unpredictable fallback behavior.

**Fix:** Explicitly use `OEM_LSTM_ONLY` (value `1`) for consistent neural network processing, or `OEM_TESSERACT_ONLY` (value `0`) for legacy-only processing.

## Summary

- **Tesseract 4+** supports two distinct engines: the legacy pattern matcher (`OEM_TESSERACT_ONLY`) and the LSTM neural network (`OEM_LSTM_ONLY`).
- **Migration is configuration-driven**: Pass `OEM_LSTM_ONLY` (value `1`) to `TessBaseAPI::Init` in **[`src/api/baseapi.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/api/baseapi.cpp)**, or use `--oem 1` on the command line.
- **Data requirement**: Ensure your `.traineddata` contains the LSTM component; the engine selection logic in **[`src/ccmain/tessedit.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/ccmain/tessedit.cpp)** checks `IsLSTMAvailable()` and falls back to legacy if the component is missing.
- **Compile-time optimization**: Use `-DDISABLED_LEGACY_ENGINE` to remove legacy code entirely and reduce binary size.
- **API consistency**: The `OcrEngineMode` enum in **[`include/tesseract/publictypes.h`](https://github.com/tesseract-ocr/tesseract/blob/main/include/tesseract/publictypes.h)** provides the canonical values used across the C++ API, command-line interface, and language bindings.

## Frequently Asked Questions

### What is the difference between OEM_LSTM_ONLY and OEM_TESSERACT_LSTM_COMBINED?

`OEM_LSTM_ONLY` (value `1`) forces Tesseract to use exclusively the LSTM neural network engine, failing initialization if the LSTM component is absent from the language data. `OEM_TESSERACT_LSTM_COMBINED` (value `2`) attempts to use LSTM first but falls back to the legacy engine when the neural network fails; however, this combined mode is deprecated in the source code and can produce inconsistent results. For modern deployments, use `OEM_LSTM_ONLY` to ensure deterministic neural network processing.

### How do I verify that my traineddata file supports LSTM?

Run the `combine_tessdata` utility with the `-l` flag to list the components inside your `.traineddata` archive. If the output includes an entry for `lstm`, the file contains the LSTM neural network component required for `OEM_LSTM_ONLY`. Alternatively, attempt initialization with `OEM_LSTM_ONLY`; if `TessBaseAPI::Init` returns an error or the console logs "Error: LSTM requested, but not present!!", the LSTM component is missing from your language data.

### Can I disable the legacy engine completely when building Tesseract?

Yes. Pass `-DDISABLED_LEGACY_ENGINE=ON` to CMake when compiling. This define activates preprocessor guards throughout the codebase—particularly in [`src/ccmain/tessedit.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/ccmain/tessedit.cpp) and the classifier modules—removing the legacy Tesseract engine code entirely from the binary. When compiled this way, attempting to use `OEM_TESSERACT_ONLY` or `OEM_TESSERACT_LSTM_COMBINED` will result in an initialization error, ensuring only the LSTM neural network engine is available.

### Why does my Python code still use the legacy engine after setting --oem 1?

The `pytesseract` wrapper passes the `config` string directly to the Tesseract binary, but ensure you are using a Tesseract 4.0+ binary and a language data file containing the LSTM component. If the underlying Tesseract installation is version 3.x, the `--oem` flag is unrecognized and defaults to legacy mode. Additionally, verify that no environment variable (`TESSDATA_PREFIX`) is pointing to an old language data directory lacking the LSTM neural network files.