Difference Between OEM_TESSERACT_ONLY and OEM_LSTM_ONLY in Tesseract

OEM_TESSERACT_ONLY runs the legacy rule-based OCR engine for maximum speed, while OEM_LSTM_ONLY runs the neural network LSTM recognizer for highest accuracy on modern text.

Tesseract, the open-source OCR engine maintained in the tesseract-ocr/tesseract repository, supports multiple recognition pipelines selectable through the OcrEngineMode (OEM) enum. Understanding the distinction between the legacy and LSTM modes is critical for optimizing accuracy and performance in production OCR workflows.

What Are Tesseract OCR Engine Modes?

The OCR engine mode determines which recognition pipeline Tesseract uses to process images. These modes are defined in include/tesseract/publictypes.h【64†L64-L66】 as an enum that allows developers to explicitly choose between the legacy engine, the LSTM neural network, or combined approaches.

When initializing Tesseract via Init(), you pass an OEM value that locks the API into a specific recognition strategy for that session. This decision impacts everything from model loading requirements to recognition speed and accuracy characteristics.

OEM_TESSERACT_ONLY vs OEM_LSTM_ONLY: Key Differences

Legacy Engine (OEM_TESSERACT_ONLY)

OEM_TESSERACT_ONLY activates the traditional Tesseract engine that relies on character classification and adaptive thresholding algorithms. This mode operates entirely without LSTM neural networks.

  • Performance: Fastest processing speed with minimal memory overhead
  • Dependencies: Requires only the legacy .traineddata files without LSTM components
  • Status: Marked as deprecated in current versions; the legacy engine is being phased out
  • Best for: Environments where LSTM models are unavailable or when processing simple, high-contrast documents requiring maximum throughput

LSTM Neural Network (OEM_LSTM_ONLY)

OEM_LSTM_ONLY engages the Long Short-Term Memory neural network recognizer. This modern pipeline uses deep learning to recognize text lines as sequences rather than individual characters.

  • Accuracy: Superior recognition rates for modern fonts, degraded documents, and cursive text
  • Requirements: Mandates LSTM-compatible .traineddata files; initialization fails if these models are missing
  • Processing: Computationally intensive with higher memory usage than legacy mode
  • Best for: Production applications requiring maximum accuracy on complex or low-quality inputs

How Tesseract Selects the Engine Internally

When you specify OEM_DEFAULT, Tesseract resolves the actual engine mode based on compiled-in capabilities and available training data. This logic resides in src/ccmain/tessedit.cpp【100†L100-L108】:

if (!mgr->IsLSTMAvailable()) {
    tessedit_ocr_engine_mode.set_value(OEM_TESSERACT_ONLY);
} else if (!mgr->IsBaseAvailable()) {
    tessedit_ocr_engine_mode.set_value(OEM_LSTM_ONLY);
} else {
    tessedit_ocr_engine_mode.set_value(OEM_TESSERACT_LSTM_COMBINED);
}

This initialization sequence demonstrates the mutual exclusivity of the two modes:

  • OEM_TESSERACT_ONLY forces the base (legacy) engine exclusively
  • OEM_LSTM_ONLY forces the LSTM engine exclusively

The combined mode (OEM_TESSERACT_LSTM_COMBINED) historically attempted LSTM first with legacy fallback, but this approach is also deprecated in favor of explicit single-engine selection.

Practical Implementation Examples

When integrating Tesseract via the C++ API, explicitly set the engine mode during initialization to control which recognition pipeline activates:

#include <tesseract/baseapi.h>
#include <leptonica/allheaders.h>

int main() {
    // Initialize with legacy Tesseract engine only
    tesseract::TessBaseAPI api_legacy;
    if (api_legacy.Init(nullptr, "eng", tesseract::OEM_TESSERACT_ONLY) != 0) {
        fprintf(stderr, "Failed to initialize legacy engine\n");
        return 1;
    }
    
    // Process image with legacy engine...
    // api_legacy.SetImage(image);
    // char* text = api_legacy.GetUTF8Text();

    // Initialize with LSTM neural network only
    tesseract::TessBaseAPI api_lstm;
    if (api_lstm.Init(nullptr, "eng", tesseract::OEM_LSTM_ONLY) != 0) {
        fprintf(stderr, "Failed to initialize LSTM engine\n");
        return 1;
    }
    
    // LSTM requires compatible traineddata files
    // api_lstm.SetImage(image);
    // char* text_lstm = api_lstm.GetUTF8Text();
}

From Python using pytesseract, pass the OEM value as an integer configuration:

import pytesseract
from PIL import Image

# OEM_TESSERACT_ONLY = 1

# OEM_LSTM_ONLY = 2

image = Image.open('document.png')

# Legacy engine

text_legacy = pytesseract.image_to_string(
    image, 
    config='--oem 1'
)

# LSTM only

text_lstm = pytesseract.image_to_string(
    image, 
    config='--oem 2'
)

Summary

  • OEM_TESSERACT_ONLY activates the legacy rule-based OCR engine defined in include/tesseract/publictypes.h, offering maximum speed but deprecated status and lower accuracy on complex documents.

  • OEM_LSTM_ONLY engages the neural network LSTM recognizer, requiring compatible traineddata files and providing superior accuracy for modern fonts and degraded text at the cost of higher computational overhead.

  • The engine selection logic in src/ccmain/tessedit.cpp demonstrates that these modes are mutually exclusive, with initialization failing if the requested engine's dependencies are unavailable.

  • Modern production deployments should prefer OEM_LSTM_ONLY for accuracy-critical applications, while OEM_TESSERACT_ONLY remains available for legacy compatibility or resource-constrained environments.

Frequently Asked Questions

What happens if I use OEM_LSTM_ONLY but only have legacy traineddata files?

The initialization will fail. When you specify OEM_LSTM_ONLY, Tesseract attempts to load LSTM-specific model components from the traineddata file. If these neural network components are missing—meaning you only have legacy-format traineddata—the Init() call returns a non-zero error code and OCR cannot proceed. You must download LSTM-compatible traineddata files from the Tesseract GitHub repository or use OEM_TESSERACT_ONLY instead.

Is OEM_TESSERACT_ONLY still supported in Tesseract 5.x?

While OEM_TESSERACT_ONLY remains functional in Tesseract 5.x, it is officially marked as deprecated in the source code. The Tesseract development team is phasing out the legacy engine in favor of the LSTM neural network architecture. You can still compile and run the legacy engine by setting the appropriate OEM value, but future releases may remove this capability entirely, and bug fixes for the legacy engine are no longer prioritized.

Which OEM mode should I choose for handwritten text recognition?

For handwritten text, you should use OEM_LSTM_ONLY. The legacy Tesseract engine was designed primarily for printed text with clear character segmentation, and it performs poorly on connected handwriting or cursive scripts. The LSTM neural network processes entire text lines as sequences rather than individual characters, making it significantly more robust for handwritten content. Ensure you use traineddata files specifically fine-tuned for handwriting when available, as standard models optimized for printed text may still struggle with heavy cursive writing.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →