# How to Perform Multilingual OCR with Tesseract Using Language Stacking

> Learn how to perform multilingual OCR with Tesseract using language stacking. Master single-pass recognition with plus-separated syntax and optimize your OCR workflow today.

- Repository: [tesseract-ocr/tesseract](https://github.com/tesseract-ocr/tesseract)
- Tags: how-to-guide
- Published: 2026-03-02

---

**Tesseract performs multilingual OCR in a single pass by stacking language models using plus-separated syntax like `eng+hin+deu`, where the first language acts as the primary engine and subsequent languages are loaded as sub-languages into a dedicated vector.**

The tesseract-ocr/tesseract repository enables recognition of mixed-language documents without multiple execution passes. By leveraging **language stacking**, you combine several `.traineddata` files into one OCR session, allowing the engine to query dictionaries and classifiers across all specified languages simultaneously.

## How Language Stacking Works in Tesseract

The stacking mechanism is implemented in the core C++ engine and exposed through the C and C++ APIs. When you provide a language string such as `"eng+hin+deu"`, Tesseract parses the first token as the primary language and treats every subsequent token as a sub-language.

### Initialization and the Language Stack

The process begins in [`src/api/baseapi.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/api/baseapi.cpp) when `TessBaseAPI::Init` receives the language string and stores it in the internal `language_` member. This method forwards the request to `Tesseract::init_tesseract_internal`, located in [`src/ccmain/tessedit.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/ccmain/tessedit.cpp), which orchestrates the loading sequence.

The function `init_tesseract_internal` calls `Tesseract::init_tesseract_lang_data` to parse the stack. The first entry is loaded into the current `Tesseract` object, while every subsequent entry triggers the creation of a new `Tesseract` instance appended to the `std::vector<Tesseract*> sub_langs_` declared in [`src/ccmain/tesseractclass.h`](https://github.com/tesseract-ocr/tesseract/blob/main/src/ccmain/tesseractclass.h). This vector maintains separate classifier and dictionary contexts for each sub-language.

### Sub-language Dependencies

Languages can declare **additional required languages** through the configuration variable `tessedit_load_sublangs`. After a language loads, the engine calls `ParseLanguageString` to expand the list, potentially adding more entries to `sub_langs_` automatically. For example, if the German configuration file contains `tessedit_load_sublangs = fra`, loading German implicitly loads French as an additional sub-language regardless of the command-line input.

### Parameter Synchronization

To ensure comparable word ratings across all loaded languages, Tesseract optionally synchronizes the language-model parameter set. If `tessedit_use_primary_params_model` is set to **true** (the default), the primary language’s `ParamsModel` is copied to every sub-language instance after initialization. This guarantees that confidence scores and word ratings use consistent weights across the entire stack, preventing bias toward any single language’s statistical model.

## Configuration Variables for Multilingual OCR

Tesseract exposes several knobs to control stacking behavior:

- **`tessedit_load_sublangs`** – Specifies additional languages that must load alongside the primary language, typically defined inside a language-specific configuration file to force dependencies.
- **`tessedit_use_primary_params_model`** – When true, copies the primary language’s model parameters to all sub-languages for consistent scoring. When false, each language retains its own distinct parameter weights.
- **`tessedit_allow_ambig_words`** – Controls how the engine handles ambiguous words that appear in multiple languages’ dictionaries.

## Runtime Recognition Process

During recognition, the primary `Tesseract` instance drives the OCR pipeline. When evaluating words, the engine queries the appropriate sub-language’s dictionary and classifier through the shared `language_model_`. The main instance coordinates results across the stack without requiring separate API calls for each language.

For introspection, the API provides:

- `GetInitLanguagesAsString()` – Returns the original stack string exactly as passed to `Init` (e.g., `"eng+hin+deu"`).
- `GetLoadedLanguagesAsVector()` – Returns a vector containing the primary language and every sub-language actually loaded, including those introduced by `tessedit_load_sublangs` dependencies.

Both methods are implemented in [`src/api/baseapi.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/api/baseapi.cpp), with C wrappers available in [`src/api/capi.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/api/capi.cpp).

## Practical Implementation Examples

### Command-Line Usage

Use the `-l` flag with plus signs separating language codes:

```bash
tesseract input.png stdout -l eng+hin+deu --psm 3

```

This command loads English as the primary language, with Hindi and German as sub-languages. Tesseract searches all three dictionaries during recognition.

### C++ API Implementation

```cpp
#include <tesseract/baseapi.h>
#include <leptonica/allheaders.h>

int main() {
  tesseract::TessBaseAPI api;
  
  // Initialize with English (primary), Hindi, and German
  if (api.Init("/usr/share/tessdata", "eng+hin+deu", tesseract::OEM_DEFAULT) != 0) {
    fprintf(stderr, "Could not initialize tesseract.\n");
    return 1;
  }

  // Inspect which languages were actually loaded
  std::vector<std::string> loaded;
  api.GetLoadedLanguagesAsVector(&loaded);
  for (const auto &lang : loaded) {
    printf("Loaded language: %s\n", lang.c_str());
  }

  // Perform OCR
  Pix *image = pixRead("input.png");
  api.SetImage(image);
  char *text = api.GetUTF8Text();
  printf("%s\n", text);

  // Cleanup
  delete [] text;
  pixDestroy(&image);
  api.End();
  return 0;
}

```

The `Init` method handles the stack construction, while `GetLoadedLanguagesAsVector` reveals the complete set of loaded models including implicit dependencies.

### C API Implementation

```c
#include <tesseract/capi.h>

int main() {
  TessBaseAPI *handle = TessBaseAPICreate();
  
  // Initialize with language stack
  if (TessBaseAPIInit3(handle, "/usr/share/tessdata", "eng+hin+deu") != 0) {
    fprintf(stderr, "Init failed\n");
    return 1;
  }

  TessBaseAPISetImageFile(handle, "input.png", NULL);
  char *out = TessBaseAPIGetUTF8Text(handle);
  printf("%s\n", out);

  // Show loaded languages
  char **langs = TessBaseAPIGetLoadedLanguagesAsVector(handle);
  for (int i = 0; langs[i] != NULL; ++i) {
    printf("Loaded: %s\n", langs[i]);
  }

  TessDeleteText(out);
  TessBaseAPIDelete(handle);
  return 0;
}

```

### Advanced Configuration with tessedit_load_sublangs

Create a custom configuration file [`custom.cfg`](https://github.com/tesseract-ocr/tesseract/blob/main/custom.cfg):

```

tessedit_load_sublangs = fra

```

Then invoke Tesseract:

```bash
tesseract input.png stdout -l deu+eng --psm 3 custom.cfg

```

This loads German (primary), English (explicitly stacked), and French (implicitly required by the German configuration via `tessedit_load_sublangs`).

## Key Source Files and Architecture

The language stacking implementation spans several core files:

- **[`src/api/baseapi.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/api/baseapi.cpp)** – Contains `TessBaseAPI::Init`, `GetInitLanguagesAsString`, and `GetLoadedLanguagesAsVector`, serving as the entry point for language string processing.
- **[`src/ccmain/tessedit.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/ccmain/tessedit.cpp)** – Implements `init_tesseract_internal` and `init_tesseract_lang_data`, which parse the language stack and populate the `sub_langs_` vector.
- **[`src/ccmain/tesseractclass.h`](https://github.com/tesseract-ocr/tesseract/blob/main/src/ccmain/tesseractclass.h)** – Declares the `sub_langs_` member variable that stores pointers to sub-language `Tesseract` instances.
- **[`src/api/capi.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/api/capi.cpp)** – Provides C-language wrappers exposing the stacking functionality to external applications.
- **[`src/ccmain/tesseractclass.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/ccmain/tesseractclass.cpp)** – Implements helper methods for managing sub-language lifecycles and adaptive classifier resets.

## Summary

- **Language stacking** uses plus-separated codes (`eng+hin+deu`) to load multiple languages in one Tesseract instance.
- The **first language** in the string becomes the primary engine; subsequent entries load into the `sub_langs_` vector as sub-languages.
- **`tessedit_load_sublangs`** allows language configs to force additional automatic dependencies beyond the command-line specification.
- **`tessedit_use_primary_params_model`** synchronizes scoring weights across all languages when true (default), ensuring comparable confidence ratings.
- Use **`GetLoadedLanguagesAsVector`** to inspect the complete set of loaded models, including implicit dependencies.

## Frequently Asked Questions

### What is the correct syntax for language stacking in Tesseract?

Separate language codes with plus signs: `eng+hin+deu`. The first code is the primary language; all others are sub-languages. Do not include spaces between codes or plus signs.

### How does Tesseract handle the primary language versus sub-languages?

The primary language loads into the main `Tesseract` object and drives the recognition pipeline. Sub-languages instantiate separate `Tesseract` objects stored in the `sub_langs_` vector, contributing their dictionaries and classifiers to the shared language model while the primary instance coordinates the OCR process.

### Can I force additional languages to load beyond what I specify in the command?

Yes. Set the `tessedit_load_sublangs` variable inside a language’s configuration file (e.g., `deu.config`) to list extra languages that must load whenever that language is initialized. These appear in the loaded vector even if omitted from the command-line language string.

### Why are my word confidence scores different when using multiple languages?

If `tessedit_use_primary_params_model` is false, each language uses its own distinct `ParamsModel` weights, causing incomparable scores. Ensure this variable remains true (the default) to copy the primary language’s parameter model to all sub-languages, standardizing the rating scale across the entire stack.