How to Implement Bilingual OCR with Language Priority Overrides in Tesseract
Tesseract processes bilingual documents by loading multiple language models in a single initialization call, where the order of language codes determines priority and the tilde (~) prefix prevents automatic loading of sub-language dependencies to maintain strict control.
Bilingual OCR with language priority overrides is essential for accurately recognizing documents containing mixed-language content, such as English-Spanish forms or French-German technical manuals. The tesseract-ocr/tesseract engine implements this capability through a specialized language string syntax parsed during the Init sequence, allowing developers to specify exactly which traineddata models load and in what order of precedence.
Understanding Tesseract’s Language String Syntax
Tesseract accepts a compact language string that defines both the languages to load and their priority hierarchy. The syntax follows this pattern:
[~]lang1[+ [~]lang2[+ …]]
- lang: ISO-639-3 language code (e.g.,
eng,spa,deu,fra). - +: Separator indicating additional languages.
- ~: Prefix that excludes the language’s default sub-languages (loaded via
tessedit_load_sublangs).
The first language in the string becomes the primary language, whose character set, dictionary, and language-specific parameters take precedence during glyph recognition. Subsequent languages serve as secondary fallbacks.
How Language Priority Overrides Work Internally
Parsing the Language String
The core parsing logic resides in src/ccmain/tessedit.cpp within the ParseLanguageString function. When TessBaseAPI::Init (declared in include/tesseract/baseapi.h) is invoked, it forwards the language string to Tesseract::init_tesseract, which calls ParseLanguageString.
This parser iterates through the string token by token:
- Strips leading
+characters. - Checks for the
~prefix. - If present, adds the language to
langs_not_to_load; otherwise adds tolangs_to_load. - Appends the token (minus the
~) to the appropriate vector for subsequent model loading.
Loading and Exclusion Logic
After parsing, init_tesseract (in src/ccmain/tessedit.cpp) iterates over langs_to_load, loading each model via TessdataManager. Crucially, when loading sub-languages (defined in a language’s config file via tessedit_load_sublangs), the engine checks against langs_not_to_load. If a sub-language appears in the exclusion list (prefixed with ~), it is skipped.
This mechanism ensures that the ~ prefix effectively prevents priority shifts that would otherwise occur if sub-languages loaded automatically and took precedence over your specified secondary language.
Practical Implementation Examples
Command Line Interface (CLI)
The simplest way to implement bilingual OCR with language priority overrides is via the -l flag:
# Spanish primary, English secondary, exclude English sub-languages
tesseract document.png output -l spa+~eng
# English primary, French secondary (normal sub-language loading)
tesseract document.png output -l eng+fra
C++ API Implementation
Using include/tesseract/baseapi.h, initialize with priority control:
#include <tesseract/baseapi.h>
#include <leptonica/allheaders.h>
#include <vector>
#include <string>
int main() {
tesseract::TessBaseAPI api;
// Initialize: English primary, Spanish secondary, exclude Spanish defaults
if (api.Init("/usr/share/tessdata", "eng+~spa") != 0) {
fprintf(stderr, "Failed to initialize Tesseract\n");
return 1;
}
// Verify loaded languages
std::vector<std::string> loaded_langs;
api.GetLoadedLanguagesAsVector(&loaded_langs);
printf("Priority order of loaded languages:\n");
for (size_t i = 0; i < loaded_langs.size(); ++i) {
printf(" %zu: %s\n", i + 1, loaded_langs[i].c_str());
}
Pix *image = pixRead("bilingual_document.png");
api.SetImage(image);
char *text = api.GetUTF8Text();
printf("\nOCR Output:\n%s\n", text);
delete[] text;
pixDestroy(&image);
api.End();
return 0;
}
C API Implementation
For applications using include/tesseract/capi.h:
#include <tesseract/capi.h>
#include <stdio.h>
int main() {
TessBaseAPI *handle = TessBaseAPICreate();
// French primary, German secondary, exclude German sub-languages
if (TessBaseAPIInit3(handle, "/usr/share/tessdata", "fra+~deu") != 0) {
fprintf(stderr, "Initialization failed\n");
return 1;
}
TessBaseAPISetImageFile(handle, "mixed_language.png", NULL);
char *text = TessBaseAPIGetUTF8Text(handle);
printf("Recognized text:\n%s\n", text);
TessDeleteText(text);
TessBaseAPIEnd(handle);
TessBaseAPIDelete(handle);
return 0;
}
Verifying Loaded Languages
After initialization, confirm your bilingual OCR configuration using GetLoadedLanguagesAsVector:
std::vector<std::string> loaded;
api.GetLoadedLanguagesAsVector(&loaded);
// Verify primary language is first
if (!loaded.empty() && loaded[0] == "eng") {
printf("English correctly set as primary\n");
}
This verification step ensures that languages prefixed with ~ did not load unwanted sub-languages, confirming your priority overrides are active.
Summary
- Language string syntax: Use
lang1+lang2where order establishes priority; prefix any language with~to prevent it from loading its default sub-language dependencies. - Implementation location: The parser resides in
src/ccmain/tessedit.cpp(ParseLanguageString), invoked byTesseract::init_tesseractand exposed throughinclude/tesseract/baseapi.h. - API availability: Supported in C++ (
TessBaseAPI::Init), C (TessBaseAPIInit3), and CLI via the-lflag. - Priority control: The first language in the string becomes primary, using its character set and dictionary first; subsequent languages serve as fallbacks.
- Verification: Use
GetLoadedLanguagesAsVectorto confirm the final loaded set matches your intended priority order.
Frequently Asked Questions
What happens if I don't use the ~ prefix when loading multiple languages?
Without the ~ prefix, Tesseract loads not only the specified language but also any sub-languages defined in that language's configuration via tessedit_load_sublangs. These automatically loaded sub-languages might include scripts or variants that could interfere with your intended secondary language, potentially altering the effective priority order and reducing accuracy for your target text. For strict bilingual OCR with language priority overrides, always use ~ when you need to exclude automatic sub-language dependencies.
How do I check which languages are actually loaded during OCR?
After calling Init (C++) or TessBaseAPIInit3 (C), invoke the GetLoadedLanguagesAsVector method to retrieve a vector of language codes in their actual priority order. This allows you to verify that languages prefixed with ~ did not load their sub-languages and that your primary language appears first in the list. In the CLI, you can infer the configuration by watching for warning messages about missing traineddata files, though the API method provides definitive confirmation of your bilingual OCR setup.
Can I use more than two languages with priority overrides?
Yes, the language string syntax supports multiple languages: lang1+lang2+lang3+.... The same priority rules apply—the first language is primary, and subsequent languages are secondary. You can prefix any language with ~ to prevent it from loading its sub-languages. For example, eng+~spa+~fra loads English as primary, Spanish and French as secondary, while excluding Spanish and French sub-language dependencies that might otherwise complicate the priority hierarchy.
Does the order of languages in the string affect OCR accuracy?
Yes, the order directly impacts accuracy because Tesseract uses the primary language's character set, dictionary, and language model parameters as the default for ambiguous glyph recognition. If you place a language with a limited character set first, Tesseract might misrecognize characters that belong to the secondary language's script. For optimal bilingual OCR with language priority overrides, place the language with the most complex script or the one dominating the document content first in the string, and use ~ to prevent unwanted sub-language interference.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →