how-to-guide

How to Perform Multilingual OCR with Tesseract Using Language Stacking

March 2, 2026 tesseract-ocr/tesseract ↗

Tesseract performs multilingual OCR in a single pass by stacking language models using plus-separated syntax like eng+hin+deu, where the first language acts as the primary engine and subsequent languages are loaded as sub-languages into a dedicated vector.

The tesseract-ocr/tesseract repository enables recognition of mixed-language documents without multiple execution passes. By leveraging language stacking, you combine several .traineddata files into one OCR session, allowing the engine to query dictionaries and classifiers across all specified languages simultaneously.

How Language Stacking Works in Tesseract

The stacking mechanism is implemented in the core C++ engine and exposed through the C and C++ APIs. When you provide a language string such as "eng+hin+deu", Tesseract parses the first token as the primary language and treats every subsequent token as a sub-language.

Initialization and the Language Stack

The process begins in src/api/baseapi.cpp when TessBaseAPI::Init receives the language string and stores it in the internal language_ member. This method forwards the request to Tesseract::init_tesseract_internal, located in src/ccmain/tessedit.cpp, which orchestrates the loading sequence.

The function init_tesseract_internal calls Tesseract::init_tesseract_lang_data to parse the stack. The first entry is loaded into the current Tesseract object, while every subsequent entry triggers the creation of a new Tesseract instance appended to the std::vector<Tesseract*> sub_langs_ declared in src/ccmain/tesseractclass.h. This vector maintains separate classifier and dictionary contexts for each sub-language.

Sub-language Dependencies

Languages can declare additional required languages through the configuration variable tessedit_load_sublangs. After a language loads, the engine calls ParseLanguageString to expand the list, potentially adding more entries to sub_langs_ automatically. For example, if the German configuration file contains tessedit_load_sublangs = fra, loading German implicitly loads French as an additional sub-language regardless of the command-line input.

Parameter Synchronization

To ensure comparable word ratings across all loaded languages, Tesseract optionally synchronizes the language-model parameter set. If tessedit_use_primary_params_model is set to true (the default), the primary language’s ParamsModel is copied to every sub-language instance after initialization. This guarantees that confidence scores and word ratings use consistent weights across the entire stack, preventing bias toward any single language’s statistical model.

Configuration Variables for Multilingual OCR

Tesseract exposes several knobs to control stacking behavior:

tessedit_load_sublangs – Specifies additional languages that must load alongside the primary language, typically defined inside a language-specific configuration file to force dependencies.
tessedit_use_primary_params_model – When true, copies the primary language’s model parameters to all sub-languages for consistent scoring. When false, each language retains its own distinct parameter weights.
tessedit_allow_ambig_words – Controls how the engine handles ambiguous words that appear in multiple languages’ dictionaries.

Runtime Recognition Process

During recognition, the primary Tesseract instance drives the OCR pipeline. When evaluating words, the engine queries the appropriate sub-language’s dictionary and classifier through the shared language_model_. The main instance coordinates results across the stack without requiring separate API calls for each language.

For introspection, the API provides:

GetInitLanguagesAsString() – Returns the original stack string exactly as passed to Init (e.g., "eng+hin+deu").
GetLoadedLanguagesAsVector() – Returns a vector containing the primary language and every sub-language actually loaded, including those introduced by tessedit_load_sublangs dependencies.

Both methods are implemented in src/api/baseapi.cpp, with C wrappers available in src/api/capi.cpp.

Practical Implementation Examples

Command-Line Usage

Use the -l flag with plus signs separating language codes:

tesseract input.png stdout -l eng+hin+deu --psm 3

This command loads English as the primary language, with Hindi and German as sub-languages. Tesseract searches all three dictionaries during recognition.

C++ API Implementation

#include <tesseract/baseapi.h>
#include <leptonica/allheaders.h>

int main() {
  tesseract::TessBaseAPI api;
  
  // Initialize with English (primary), Hindi, and German
  if (api.Init("/usr/share/tessdata", "eng+hin+deu", tesseract::OEM_DEFAULT) != 0) {
    fprintf(stderr, "Could not initialize tesseract.\n");
    return 1;
  }

  // Inspect which languages were actually loaded
  std::vector<std::string> loaded;
  api.GetLoadedLanguagesAsVector(&loaded);
  for (const auto &lang : loaded) {
    printf("Loaded language: %s\n", lang.c_str());
  }

  // Perform OCR
  Pix *image = pixRead("input.png");
  api.SetImage(image);
  char *text = api.GetUTF8Text();
  printf("%s\n", text);

  // Cleanup
  delete [] text;
  pixDestroy(&image);
  api.End();
  return 0;
}

The Init method handles the stack construction, while GetLoadedLanguagesAsVector reveals the complete set of loaded models including implicit dependencies.

C API Implementation

#include <tesseract/capi.h>

int main() {
  TessBaseAPI *handle = TessBaseAPICreate();
  
  // Initialize with language stack
  if (TessBaseAPIInit3(handle, "/usr/share/tessdata", "eng+hin+deu") != 0) {
    fprintf(stderr, "Init failed\n");
    return 1;
  }

  TessBaseAPISetImageFile(handle, "input.png", NULL);
  char *out = TessBaseAPIGetUTF8Text(handle);
  printf("%s\n", out);

  // Show loaded languages
  char **langs = TessBaseAPIGetLoadedLanguagesAsVector(handle);
  for (int i = 0; langs[i] != NULL; ++i) {
    printf("Loaded: %s\n", langs[i]);
  }

  TessDeleteText(out);
  TessBaseAPIDelete(handle);
  return 0;
}

Advanced Configuration with tessedit_load_sublangs

Create a custom configuration file custom.cfg:


tessedit_load_sublangs = fra

Then invoke Tesseract:

tesseract input.png stdout -l deu+eng --psm 3 custom.cfg

This loads German (primary), English (explicitly stacked), and French (implicitly required by the German configuration via tessedit_load_sublangs).

Key Source Files and Architecture

The language stacking implementation spans several core files:

src/api/baseapi.cpp – Contains TessBaseAPI::Init, GetInitLanguagesAsString, and GetLoadedLanguagesAsVector, serving as the entry point for language string processing.
src/ccmain/tessedit.cpp – Implements init_tesseract_internal and init_tesseract_lang_data, which parse the language stack and populate the sub_langs_ vector.
src/ccmain/tesseractclass.h – Declares the sub_langs_ member variable that stores pointers to sub-language Tesseract instances.
src/api/capi.cpp – Provides C-language wrappers exposing the stacking functionality to external applications.
src/ccmain/tesseractclass.cpp – Implements helper methods for managing sub-language lifecycles and adaptive classifier resets.

Summary

Language stacking uses plus-separated codes (eng+hin+deu) to load multiple languages in one Tesseract instance.
The first language in the string becomes the primary engine; subsequent entries load into the sub_langs_ vector as sub-languages.
tessedit_load_sublangs allows language configs to force additional automatic dependencies beyond the command-line specification.
tessedit_use_primary_params_model synchronizes scoring weights across all languages when true (default), ensuring comparable confidence ratings.
Use GetLoadedLanguagesAsVector to inspect the complete set of loaded models, including implicit dependencies.

Frequently Asked Questions

What is the correct syntax for language stacking in Tesseract?

Separate language codes with plus signs: eng+hin+deu. The first code is the primary language; all others are sub-languages. Do not include spaces between codes or plus signs.

How does Tesseract handle the primary language versus sub-languages?

The primary language loads into the main Tesseract object and drives the recognition pipeline. Sub-languages instantiate separate Tesseract objects stored in the sub_langs_ vector, contributing their dictionaries and classifiers to the shared language model while the primary instance coordinates the OCR process.

Can I force additional languages to load beyond what I specify in the command?

Yes. Set the tessedit_load_sublangs variable inside a language’s configuration file (e.g., deu.config) to list extra languages that must load whenever that language is initialized. These appear in the loaded vector even if omitted from the command-line language string.

Why are my word confidence scores different when using multiple languages?

If tessedit_use_primary_params_model is false, each language uses its own distinct ParamsModel weights, causing incomparable scores. Ensure this variable remains true (the default) to copy the primary language’s parameter model to all sub-languages, standardizing the rating scale across the entire stack.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:

curl -s "https://instagit.com/install.md"

Add to your MCP client configuration:

{
  "mcpServers": {
    "instagit": {
      "command": "npx",
      "args": ["-y", "instagit@latest"]
    }
  }
}

Ask your agent:

"Use Instagit MCP to understand how tesseract-ocr/tesseract works."

Works with

Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →