# Difference Between OEM_TESSERACT_ONLY and OEM_LSTM_ONLY in Tesseract

> Understand the Tesseract OCR difference between OEM_TESSERACT_ONLY for speed and OEM_LSTM_ONLY for accuracy. Choose the best engine for your OCR needs.

- Repository: [tesseract-ocr/tesseract](https://github.com/tesseract-ocr/tesseract)
- Tags: deep-dive
- Published: 2026-03-02

---

**`OEM_TESSERACT_ONLY` runs the legacy rule-based OCR engine for maximum speed, while `OEM_LSTM_ONLY` runs the neural network LSTM recognizer for highest accuracy on modern text.**

Tesseract, the open-source OCR engine maintained in the `tesseract-ocr/tesseract` repository, supports multiple recognition pipelines selectable through the **OcrEngineMode** (OEM) enum. Understanding the distinction between the legacy and LSTM modes is critical for optimizing accuracy and performance in production OCR workflows.

## What Are Tesseract OCR Engine Modes?

The OCR engine mode determines which recognition pipeline Tesseract uses to process images. These modes are defined in [`include/tesseract/publictypes.h`](https://github.com/tesseract-ocr/tesseract/blob/main/include/tesseract/publictypes.h)【64†L64-L66】 as an enum that allows developers to explicitly choose between the legacy engine, the LSTM neural network, or combined approaches.

When initializing Tesseract via `Init()`, you pass an OEM value that locks the API into a specific recognition strategy for that session. This decision impacts everything from model loading requirements to recognition speed and accuracy characteristics.

## OEM_TESSERACT_ONLY vs OEM_LSTM_ONLY: Key Differences

### Legacy Engine (OEM_TESSERACT_ONLY)

**`OEM_TESSERACT_ONLY`** activates the traditional Tesseract engine that relies on character classification and adaptive thresholding algorithms. This mode operates entirely without LSTM neural networks.

- **Performance**: Fastest processing speed with minimal memory overhead
- **Dependencies**: Requires only the legacy `.traineddata` files without LSTM components
- **Status**: Marked as deprecated in current versions; the legacy engine is being phased out
- **Best for**: Environments where LSTM models are unavailable or when processing simple, high-contrast documents requiring maximum throughput

### LSTM Neural Network (OEM_LSTM_ONLY)

**`OEM_LSTM_ONLY`** engages the Long Short-Term Memory neural network recognizer. This modern pipeline uses deep learning to recognize text lines as sequences rather than individual characters.

- **Accuracy**: Superior recognition rates for modern fonts, degraded documents, and cursive text
- **Requirements**: Mandates LSTM-compatible `.traineddata` files; initialization fails if these models are missing
- **Processing**: Computationally intensive with higher memory usage than legacy mode
- **Best for**: Production applications requiring maximum accuracy on complex or low-quality inputs

## How Tesseract Selects the Engine Internally

When you specify `OEM_DEFAULT`, Tesseract resolves the actual engine mode based on compiled-in capabilities and available training data. This logic resides in [`src/ccmain/tessedit.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/ccmain/tessedit.cpp)【100†L100-L108】:

```cpp
if (!mgr->IsLSTMAvailable()) {
    tessedit_ocr_engine_mode.set_value(OEM_TESSERACT_ONLY);
} else if (!mgr->IsBaseAvailable()) {
    tessedit_ocr_engine_mode.set_value(OEM_LSTM_ONLY);
} else {
    tessedit_ocr_engine_mode.set_value(OEM_TESSERACT_LSTM_COMBINED);
}

```

This initialization sequence demonstrates the mutual exclusivity of the two modes:

- **`OEM_TESSERACT_ONLY`** forces the *base* (legacy) engine exclusively
- **`OEM_LSTM_ONLY`** forces the *LSTM* engine exclusively

The combined mode (`OEM_TESSERACT_LSTM_COMBINED`) historically attempted LSTM first with legacy fallback, but this approach is also deprecated in favor of explicit single-engine selection.

## Practical Implementation Examples

When integrating Tesseract via the C++ API, explicitly set the engine mode during initialization to control which recognition pipeline activates:

```cpp
#include <tesseract/baseapi.h>
#include <leptonica/allheaders.h>

int main() {
    // Initialize with legacy Tesseract engine only
    tesseract::TessBaseAPI api_legacy;
    if (api_legacy.Init(nullptr, "eng", tesseract::OEM_TESSERACT_ONLY) != 0) {
        fprintf(stderr, "Failed to initialize legacy engine\n");
        return 1;
    }
    
    // Process image with legacy engine...
    // api_legacy.SetImage(image);
    // char* text = api_legacy.GetUTF8Text();

    // Initialize with LSTM neural network only
    tesseract::TessBaseAPI api_lstm;
    if (api_lstm.Init(nullptr, "eng", tesseract::OEM_LSTM_ONLY) != 0) {
        fprintf(stderr, "Failed to initialize LSTM engine\n");
        return 1;
    }
    
    // LSTM requires compatible traineddata files
    // api_lstm.SetImage(image);
    // char* text_lstm = api_lstm.GetUTF8Text();
}

```

From Python using `pytesseract`, pass the OEM value as an integer configuration:

```python
import pytesseract
from PIL import Image

# OEM_TESSERACT_ONLY = 1

# OEM_LSTM_ONLY = 2

image = Image.open('document.png')

# Legacy engine

text_legacy = pytesseract.image_to_string(
    image, 
    config='--oem 1'
)

# LSTM only

text_lstm = pytesseract.image_to_string(
    image, 
    config='--oem 2'
)

```

## Summary

- **`OEM_TESSERACT_ONLY`** activates the legacy rule-based OCR engine defined in [`include/tesseract/publictypes.h`](https://github.com/tesseract-ocr/tesseract/blob/main/include/tesseract/publictypes.h), offering maximum speed but deprecated status and lower accuracy on complex documents.

- **`OEM_LSTM_ONLY`** engages the neural network LSTM recognizer, requiring compatible traineddata files and providing superior accuracy for modern fonts and degraded text at the cost of higher computational overhead.

- The engine selection logic in [`src/ccmain/tessedit.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/ccmain/tessedit.cpp) demonstrates that these modes are mutually exclusive, with initialization failing if the requested engine's dependencies are unavailable.

- Modern production deployments should prefer `OEM_LSTM_ONLY` for accuracy-critical applications, while `OEM_TESSERACT_ONLY` remains available for legacy compatibility or resource-constrained environments.

## Frequently Asked Questions

### What happens if I use OEM_LSTM_ONLY but only have legacy traineddata files?

The initialization will fail. When you specify `OEM_LSTM_ONLY`, Tesseract attempts to load LSTM-specific model components from the traineddata file. If these neural network components are missing—meaning you only have legacy-format traineddata—the `Init()` call returns a non-zero error code and OCR cannot proceed. You must download LSTM-compatible traineddata files from the Tesseract GitHub repository or use `OEM_TESSERACT_ONLY` instead.

### Is OEM_TESSERACT_ONLY still supported in Tesseract 5.x?

While `OEM_TESSERACT_ONLY` remains functional in Tesseract 5.x, it is officially marked as deprecated in the source code. The Tesseract development team is phasing out the legacy engine in favor of the LSTM neural network architecture. You can still compile and run the legacy engine by setting the appropriate OEM value, but future releases may remove this capability entirely, and bug fixes for the legacy engine are no longer prioritized.

### Which OEM mode should I choose for handwritten text recognition?

For handwritten text, you should use `OEM_LSTM_ONLY`. The legacy Tesseract engine was designed primarily for printed text with clear character segmentation, and it performs poorly on connected handwriting or cursive scripts. The LSTM neural network processes entire text lines as sequences rather than individual characters, making it significantly more robust for handwritten content. Ensure you use traineddata files specifically fine-tuned for handwriting when available, as standard models optimized for printed text may still struggle with heavy cursive writing.