How to Perform Multilingual OCR with Tesseract Using Language Stacking
Tesseract performs multilingual OCR in a single pass by stacking language models using plus-separated syntax like eng+hin+deu, where the first language acts as the primary engine and subsequent languages are loaded as sub-languages into a dedicated vector.
The tesseract-ocr/tesseract repository enables recognition of mixed-language documents without multiple execution passes. By leveraging language stacking, you combine several .traineddata files into one OCR session, allowing the engine to query dictionaries and classifiers across all specified languages simultaneously.
How Language Stacking Works in Tesseract
The stacking mechanism is implemented in the core C++ engine and exposed through the C and C++ APIs. When you provide a language string such as "eng+hin+deu", Tesseract parses the first token as the primary language and treats every subsequent token as a sub-language.
Initialization and the Language Stack
The process begins in src/api/baseapi.cpp when TessBaseAPI::Init receives the language string and stores it in the internal language_ member. This method forwards the request to Tesseract::init_tesseract_internal, located in src/ccmain/tessedit.cpp, which orchestrates the loading sequence.
The function init_tesseract_internal calls Tesseract::init_tesseract_lang_data to parse the stack. The first entry is loaded into the current Tesseract object, while every subsequent entry triggers the creation of a new Tesseract instance appended to the std::vector<Tesseract*> sub_langs_ declared in src/ccmain/tesseractclass.h. This vector maintains separate classifier and dictionary contexts for each sub-language.
Sub-language Dependencies
Languages can declare additional required languages through the configuration variable tessedit_load_sublangs. After a language loads, the engine calls ParseLanguageString to expand the list, potentially adding more entries to sub_langs_ automatically. For example, if the German configuration file contains tessedit_load_sublangs = fra, loading German implicitly loads French as an additional sub-language regardless of the command-line input.
Parameter Synchronization
To ensure comparable word ratings across all loaded languages, Tesseract optionally synchronizes the language-model parameter set. If tessedit_use_primary_params_model is set to true (the default), the primary language’s ParamsModel is copied to every sub-language instance after initialization. This guarantees that confidence scores and word ratings use consistent weights across the entire stack, preventing bias toward any single language’s statistical model.
Configuration Variables for Multilingual OCR
Tesseract exposes several knobs to control stacking behavior:
tessedit_load_sublangs– Specifies additional languages that must load alongside the primary language, typically defined inside a language-specific configuration file to force dependencies.tessedit_use_primary_params_model– When true, copies the primary language’s model parameters to all sub-languages for consistent scoring. When false, each language retains its own distinct parameter weights.tessedit_allow_ambig_words– Controls how the engine handles ambiguous words that appear in multiple languages’ dictionaries.
Runtime Recognition Process
During recognition, the primary Tesseract instance drives the OCR pipeline. When evaluating words, the engine queries the appropriate sub-language’s dictionary and classifier through the shared language_model_. The main instance coordinates results across the stack without requiring separate API calls for each language.
For introspection, the API provides:
GetInitLanguagesAsString()– Returns the original stack string exactly as passed toInit(e.g.,"eng+hin+deu").GetLoadedLanguagesAsVector()– Returns a vector containing the primary language and every sub-language actually loaded, including those introduced bytessedit_load_sublangsdependencies.
Both methods are implemented in src/api/baseapi.cpp, with C wrappers available in src/api/capi.cpp.
Practical Implementation Examples
Command-Line Usage
Use the -l flag with plus signs separating language codes:
tesseract input.png stdout -l eng+hin+deu --psm 3
This command loads English as the primary language, with Hindi and German as sub-languages. Tesseract searches all three dictionaries during recognition.
C++ API Implementation
#include <tesseract/baseapi.h>
#include <leptonica/allheaders.h>
int main() {
tesseract::TessBaseAPI api;
// Initialize with English (primary), Hindi, and German
if (api.Init("/usr/share/tessdata", "eng+hin+deu", tesseract::OEM_DEFAULT) != 0) {
fprintf(stderr, "Could not initialize tesseract.\n");
return 1;
}
// Inspect which languages were actually loaded
std::vector<std::string> loaded;
api.GetLoadedLanguagesAsVector(&loaded);
for (const auto &lang : loaded) {
printf("Loaded language: %s\n", lang.c_str());
}
// Perform OCR
Pix *image = pixRead("input.png");
api.SetImage(image);
char *text = api.GetUTF8Text();
printf("%s\n", text);
// Cleanup
delete [] text;
pixDestroy(&image);
api.End();
return 0;
}
The Init method handles the stack construction, while GetLoadedLanguagesAsVector reveals the complete set of loaded models including implicit dependencies.
C API Implementation
#include <tesseract/capi.h>
int main() {
TessBaseAPI *handle = TessBaseAPICreate();
// Initialize with language stack
if (TessBaseAPIInit3(handle, "/usr/share/tessdata", "eng+hin+deu") != 0) {
fprintf(stderr, "Init failed\n");
return 1;
}
TessBaseAPISetImageFile(handle, "input.png", NULL);
char *out = TessBaseAPIGetUTF8Text(handle);
printf("%s\n", out);
// Show loaded languages
char **langs = TessBaseAPIGetLoadedLanguagesAsVector(handle);
for (int i = 0; langs[i] != NULL; ++i) {
printf("Loaded: %s\n", langs[i]);
}
TessDeleteText(out);
TessBaseAPIDelete(handle);
return 0;
}
Advanced Configuration with tessedit_load_sublangs
Create a custom configuration file custom.cfg:
tessedit_load_sublangs = fra
Then invoke Tesseract:
tesseract input.png stdout -l deu+eng --psm 3 custom.cfg
This loads German (primary), English (explicitly stacked), and French (implicitly required by the German configuration via tessedit_load_sublangs).
Key Source Files and Architecture
The language stacking implementation spans several core files:
src/api/baseapi.cpp– ContainsTessBaseAPI::Init,GetInitLanguagesAsString, andGetLoadedLanguagesAsVector, serving as the entry point for language string processing.src/ccmain/tessedit.cpp– Implementsinit_tesseract_internalandinit_tesseract_lang_data, which parse the language stack and populate thesub_langs_vector.src/ccmain/tesseractclass.h– Declares thesub_langs_member variable that stores pointers to sub-languageTesseractinstances.src/api/capi.cpp– Provides C-language wrappers exposing the stacking functionality to external applications.src/ccmain/tesseractclass.cpp– Implements helper methods for managing sub-language lifecycles and adaptive classifier resets.
Summary
- Language stacking uses plus-separated codes (
eng+hin+deu) to load multiple languages in one Tesseract instance. - The first language in the string becomes the primary engine; subsequent entries load into the
sub_langs_vector as sub-languages. tessedit_load_sublangsallows language configs to force additional automatic dependencies beyond the command-line specification.tessedit_use_primary_params_modelsynchronizes scoring weights across all languages when true (default), ensuring comparable confidence ratings.- Use
GetLoadedLanguagesAsVectorto inspect the complete set of loaded models, including implicit dependencies.
Frequently Asked Questions
What is the correct syntax for language stacking in Tesseract?
Separate language codes with plus signs: eng+hin+deu. The first code is the primary language; all others are sub-languages. Do not include spaces between codes or plus signs.
How does Tesseract handle the primary language versus sub-languages?
The primary language loads into the main Tesseract object and drives the recognition pipeline. Sub-languages instantiate separate Tesseract objects stored in the sub_langs_ vector, contributing their dictionaries and classifiers to the shared language model while the primary instance coordinates the OCR process.
Can I force additional languages to load beyond what I specify in the command?
Yes. Set the tessedit_load_sublangs variable inside a language’s configuration file (e.g., deu.config) to list extra languages that must load whenever that language is initialized. These appear in the loaded vector even if omitted from the command-line language string.
Why are my word confidence scores different when using multiple languages?
If tessedit_use_primary_params_model is false, each language uses its own distinct ParamsModel weights, causing incomparable scores. Ensure this variable remains true (the default) to copy the primary language’s parameter model to all sub-languages, standardizing the rating scale across the entire stack.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →