how-to-guide

How to Implement Bilingual OCR with Language Priority Overrides in Tesseract

March 2, 2026 tesseract-ocr/tesseract ↗

Tesseract processes bilingual documents by loading multiple language models in a single initialization call, where the order of language codes determines priority and the tilde (~) prefix prevents automatic loading of sub-language dependencies to maintain strict control.

Bilingual OCR with language priority overrides is essential for accurately recognizing documents containing mixed-language content, such as English-Spanish forms or French-German technical manuals. The tesseract-ocr/tesseract engine implements this capability through a specialized language string syntax parsed during the Init sequence, allowing developers to specify exactly which traineddata models load and in what order of precedence.

Understanding Tesseract’s Language String Syntax

Tesseract accepts a compact language string that defines both the languages to load and their priority hierarchy. The syntax follows this pattern:


[~]lang1[+ [~]lang2[+ …]]

lang: ISO-639-3 language code (e.g., eng, spa, deu, fra).
+: Separator indicating additional languages.
~: Prefix that excludes the language’s default sub-languages (loaded via tessedit_load_sublangs).

The first language in the string becomes the primary language, whose character set, dictionary, and language-specific parameters take precedence during glyph recognition. Subsequent languages serve as secondary fallbacks.

How Language Priority Overrides Work Internally

Parsing the Language String

The core parsing logic resides in src/ccmain/tessedit.cpp within the ParseLanguageString function. When TessBaseAPI::Init (declared in include/tesseract/baseapi.h) is invoked, it forwards the language string to Tesseract::init_tesseract, which calls ParseLanguageString.

This parser iterates through the string token by token:

Strips leading + characters.
Checks for the ~ prefix.
If present, adds the language to langs_not_to_load; otherwise adds to langs_to_load.
Appends the token (minus the ~) to the appropriate vector for subsequent model loading.

Loading and Exclusion Logic

After parsing, init_tesseract (in src/ccmain/tessedit.cpp) iterates over langs_to_load, loading each model via TessdataManager. Crucially, when loading sub-languages (defined in a language’s config file via tessedit_load_sublangs), the engine checks against langs_not_to_load. If a sub-language appears in the exclusion list (prefixed with ~), it is skipped.

This mechanism ensures that the ~ prefix effectively prevents priority shifts that would otherwise occur if sub-languages loaded automatically and took precedence over your specified secondary language.

Practical Implementation Examples

Command Line Interface (CLI)

The simplest way to implement bilingual OCR with language priority overrides is via the -l flag:


# Spanish primary, English secondary, exclude English sub-languages

tesseract document.png output -l spa+~eng

# English primary, French secondary (normal sub-language loading)

tesseract document.png output -l eng+fra

C++ API Implementation

Using include/tesseract/baseapi.h, initialize with priority control:

#include <tesseract/baseapi.h>
#include <leptonica/allheaders.h>
#include <vector>
#include <string>

int main() {
    tesseract::TessBaseAPI api;
    
    // Initialize: English primary, Spanish secondary, exclude Spanish defaults
    if (api.Init("/usr/share/tessdata", "eng+~spa") != 0) {
        fprintf(stderr, "Failed to initialize Tesseract\n");
        return 1;
    }
    
    // Verify loaded languages
    std::vector<std::string> loaded_langs;
    api.GetLoadedLanguagesAsVector(&loaded_langs);
    
    printf("Priority order of loaded languages:\n");
    for (size_t i = 0; i < loaded_langs.size(); ++i) {
        printf("  %zu: %s\n", i + 1, loaded_langs[i].c_str());
    }
    
    Pix *image = pixRead("bilingual_document.png");
    api.SetImage(image);
    char *text = api.GetUTF8Text();
    printf("\nOCR Output:\n%s\n", text);
    
    delete[] text;
    pixDestroy(&image);
    api.End();
    return 0;
}

C API Implementation

For applications using include/tesseract/capi.h:

#include <tesseract/capi.h>
#include <stdio.h>

int main() {
    TessBaseAPI *handle = TessBaseAPICreate();
    
    // French primary, German secondary, exclude German sub-languages
    if (TessBaseAPIInit3(handle, "/usr/share/tessdata", "fra+~deu") != 0) {
        fprintf(stderr, "Initialization failed\n");
        return 1;
    }
    
    TessBaseAPISetImageFile(handle, "mixed_language.png", NULL);
    char *text = TessBaseAPIGetUTF8Text(handle);
    printf("Recognized text:\n%s\n", text);
    
    TessDeleteText(text);
    TessBaseAPIEnd(handle);
    TessBaseAPIDelete(handle);
    return 0;
}

Verifying Loaded Languages

After initialization, confirm your bilingual OCR configuration using GetLoadedLanguagesAsVector:

std::vector<std::string> loaded;
api.GetLoadedLanguagesAsVector(&loaded);

// Verify primary language is first
if (!loaded.empty() && loaded[0] == "eng") {
    printf("English correctly set as primary\n");
}

This verification step ensures that languages prefixed with ~ did not load unwanted sub-languages, confirming your priority overrides are active.

Summary

Language string syntax: Use lang1+lang2 where order establishes priority; prefix any language with ~ to prevent it from loading its default sub-language dependencies.
Implementation location: The parser resides in src/ccmain/tessedit.cpp (ParseLanguageString), invoked by Tesseract::init_tesseract and exposed through include/tesseract/baseapi.h.
API availability: Supported in C++ (TessBaseAPI::Init), C (TessBaseAPIInit3), and CLI via the -l flag.
Priority control: The first language in the string becomes primary, using its character set and dictionary first; subsequent languages serve as fallbacks.
Verification: Use GetLoadedLanguagesAsVector to confirm the final loaded set matches your intended priority order.

Frequently Asked Questions

Without the ~ prefix, Tesseract loads not only the specified language but also any sub-languages defined in that language's configuration via tessedit_load_sublangs. These automatically loaded sub-languages might include scripts or variants that could interfere with your intended secondary language, potentially altering the effective priority order and reducing accuracy for your target text. For strict bilingual OCR with language priority overrides, always use ~ when you need to exclude automatic sub-language dependencies.

How do I check which languages are actually loaded during OCR?

After calling Init (C++) or TessBaseAPIInit3 (C), invoke the GetLoadedLanguagesAsVector method to retrieve a vector of language codes in their actual priority order. This allows you to verify that languages prefixed with ~ did not load their sub-languages and that your primary language appears first in the list. In the CLI, you can infer the configuration by watching for warning messages about missing traineddata files, though the API method provides definitive confirmation of your bilingual OCR setup.

Can I use more than two languages with priority overrides?

Yes, the language string syntax supports multiple languages: lang1+lang2+lang3+.... The same priority rules apply—the first language is primary, and subsequent languages are secondary. You can prefix any language with ~ to prevent it from loading its sub-languages. For example, eng+~spa+~fra loads English as primary, Spanish and French as secondary, while excluding Spanish and French sub-language dependencies that might otherwise complicate the priority hierarchy.

Does the order of languages in the string affect OCR accuracy?

Yes, the order directly impacts accuracy because Tesseract uses the primary language's character set, dictionary, and language model parameters as the default for ambiguous glyph recognition. If you place a language with a limited character set first, Tesseract might misrecognize characters that belong to the secondary language's script. For optimal bilingual OCR with language priority overrides, place the language with the most complex script or the one dominating the document content first in the string, and use ~ to prevent unwanted sub-language interference.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:

curl -s "https://instagit.com/install.md"

Add to your MCP client configuration:

{
  "mcpServers": {
    "instagit": {
      "command": "npx",
      "args": ["-y", "instagit@latest"]
    }
  }
}

Ask your agent:

"Use Instagit MCP to understand how tesseract-ocr/tesseract works."

Works with

Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →