# How to Implement Bilingual OCR with Language Priority Overrides in Tesseract

> Implement bilingual OCR in Tesseract with language priority overrides. Control language loading precisely for faster, more accurate text recognition.

- Repository: [tesseract-ocr/tesseract](https://github.com/tesseract-ocr/tesseract)
- Tags: how-to-guide
- Published: 2026-03-02

---

**Tesseract processes bilingual documents by loading multiple language models in a single initialization call, where the order of language codes determines priority and the tilde (`~`) prefix prevents automatic loading of sub-language dependencies to maintain strict control.**

Bilingual OCR with language priority overrides is essential for accurately recognizing documents containing mixed-language content, such as English-Spanish forms or French-German technical manuals. The tesseract-ocr/tesseract engine implements this capability through a specialized language string syntax parsed during the `Init` sequence, allowing developers to specify exactly which traineddata models load and in what order of precedence.

## Understanding Tesseract’s Language String Syntax

Tesseract accepts a compact language string that defines both the languages to load and their priority hierarchy. The syntax follows this pattern:

```

[~]lang1[+ [~]lang2[+ …]]

```

- **lang**: ISO-639-3 language code (e.g., `eng`, `spa`, `deu`, `fra`).
- **+**: Separator indicating additional languages.
- **~**: Prefix that excludes the language’s default sub-languages (loaded via `tessedit_load_sublangs`).

The first language in the string becomes the **primary** language, whose character set, dictionary, and language-specific parameters take precedence during glyph recognition. Subsequent languages serve as secondary fallbacks.

## How Language Priority Overrides Work Internally

### Parsing the Language String

The core parsing logic resides in [`src/ccmain/tessedit.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/ccmain/tessedit.cpp) within the `ParseLanguageString` function. When `TessBaseAPI::Init` (declared in [`include/tesseract/baseapi.h`](https://github.com/tesseract-ocr/tesseract/blob/main/include/tesseract/baseapi.h)) is invoked, it forwards the language string to `Tesseract::init_tesseract`, which calls `ParseLanguageString`.

This parser iterates through the string token by token:

1. Strips leading `+` characters.
2. Checks for the `~` prefix.
3. If present, adds the language to `langs_not_to_load`; otherwise adds to `langs_to_load`.
4. Appends the token (minus the `~`) to the appropriate vector for subsequent model loading.

### Loading and Exclusion Logic

After parsing, `init_tesseract` (in [`src/ccmain/tessedit.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/ccmain/tessedit.cpp)) iterates over `langs_to_load`, loading each model via `TessdataManager`. Crucially, when loading sub-languages (defined in a language’s config file via `tessedit_load_sublangs`), the engine checks against `langs_not_to_load`. If a sub-language appears in the exclusion list (prefixed with `~`), it is skipped.

This mechanism ensures that the `~` prefix effectively **prevents priority shifts** that would otherwise occur if sub-languages loaded automatically and took precedence over your specified secondary language.

## Practical Implementation Examples

### Command Line Interface (CLI)

The simplest way to implement bilingual OCR with language priority overrides is via the `-l` flag:

```bash

# Spanish primary, English secondary, exclude English sub-languages

tesseract document.png output -l spa+~eng

# English primary, French secondary (normal sub-language loading)

tesseract document.png output -l eng+fra

```

### C++ API Implementation

Using [`include/tesseract/baseapi.h`](https://github.com/tesseract-ocr/tesseract/blob/main/include/tesseract/baseapi.h), initialize with priority control:

```cpp
#include <tesseract/baseapi.h>
#include <leptonica/allheaders.h>
#include <vector>
#include <string>

int main() {
    tesseract::TessBaseAPI api;
    
    // Initialize: English primary, Spanish secondary, exclude Spanish defaults
    if (api.Init("/usr/share/tessdata", "eng+~spa") != 0) {
        fprintf(stderr, "Failed to initialize Tesseract\n");
        return 1;
    }
    
    // Verify loaded languages
    std::vector<std::string> loaded_langs;
    api.GetLoadedLanguagesAsVector(&loaded_langs);
    
    printf("Priority order of loaded languages:\n");
    for (size_t i = 0; i < loaded_langs.size(); ++i) {
        printf("  %zu: %s\n", i + 1, loaded_langs[i].c_str());
    }
    
    Pix *image = pixRead("bilingual_document.png");
    api.SetImage(image);
    char *text = api.GetUTF8Text();
    printf("\nOCR Output:\n%s\n", text);
    
    delete[] text;
    pixDestroy(&image);
    api.End();
    return 0;
}

```

### C API Implementation

For applications using [`include/tesseract/capi.h`](https://github.com/tesseract-ocr/tesseract/blob/main/include/tesseract/capi.h):

```c
#include <tesseract/capi.h>
#include <stdio.h>

int main() {
    TessBaseAPI *handle = TessBaseAPICreate();
    
    // French primary, German secondary, exclude German sub-languages
    if (TessBaseAPIInit3(handle, "/usr/share/tessdata", "fra+~deu") != 0) {
        fprintf(stderr, "Initialization failed\n");
        return 1;
    }
    
    TessBaseAPISetImageFile(handle, "mixed_language.png", NULL);
    char *text = TessBaseAPIGetUTF8Text(handle);
    printf("Recognized text:\n%s\n", text);
    
    TessDeleteText(text);
    TessBaseAPIEnd(handle);
    TessBaseAPIDelete(handle);
    return 0;
}

```

### Verifying Loaded Languages

After initialization, confirm your bilingual OCR configuration using `GetLoadedLanguagesAsVector`:

```cpp
std::vector<std::string> loaded;
api.GetLoadedLanguagesAsVector(&loaded);

// Verify primary language is first
if (!loaded.empty() && loaded[0] == "eng") {
    printf("English correctly set as primary\n");
}

```

This verification step ensures that languages prefixed with `~` did not load unwanted sub-languages, confirming your priority overrides are active.

## Summary

- **Language string syntax**: Use `lang1+lang2` where order establishes priority; prefix any language with `~` to prevent it from loading its default sub-language dependencies.
- **Implementation location**: The parser resides in [`src/ccmain/tessedit.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/ccmain/tessedit.cpp) (`ParseLanguageString`), invoked by `Tesseract::init_tesseract` and exposed through [`include/tesseract/baseapi.h`](https://github.com/tesseract-ocr/tesseract/blob/main/include/tesseract/baseapi.h).
- **API availability**: Supported in C++ (`TessBaseAPI::Init`), C (`TessBaseAPIInit3`), and CLI via the `-l` flag.
- **Priority control**: The first language in the string becomes primary, using its character set and dictionary first; subsequent languages serve as fallbacks.
- **Verification**: Use `GetLoadedLanguagesAsVector` to confirm the final loaded set matches your intended priority order.

## Frequently Asked Questions

### What happens if I don't use the ~ prefix when loading multiple languages?

Without the `~` prefix, Tesseract loads not only the specified language but also any sub-languages defined in that language's configuration via `tessedit_load_sublangs`. These automatically loaded sub-languages might include scripts or variants that could interfere with your intended secondary language, potentially altering the effective priority order and reducing accuracy for your target text. For strict bilingual OCR with language priority overrides, always use `~` when you need to exclude automatic sub-language dependencies.

### How do I check which languages are actually loaded during OCR?

After calling `Init` (C++) or `TessBaseAPIInit3` (C), invoke the `GetLoadedLanguagesAsVector` method to retrieve a vector of language codes in their actual priority order. This allows you to verify that languages prefixed with `~` did not load their sub-languages and that your primary language appears first in the list. In the CLI, you can infer the configuration by watching for warning messages about missing traineddata files, though the API method provides definitive confirmation of your bilingual OCR setup.

### Can I use more than two languages with priority overrides?

Yes, the language string syntax supports multiple languages: `lang1+lang2+lang3+...`. The same priority rules apply—the first language is primary, and subsequent languages are secondary. You can prefix any language with `~` to prevent it from loading its sub-languages. For example, `eng+~spa+~fra` loads English as primary, Spanish and French as secondary, while excluding Spanish and French sub-language dependencies that might otherwise complicate the priority hierarchy.

### Does the order of languages in the string affect OCR accuracy?

Yes, the order directly impacts accuracy because Tesseract uses the primary language's character set, dictionary, and language model parameters as the default for ambiguous glyph recognition. If you place a language with a limited character set first, Tesseract might misrecognize characters that belong to the secondary language's script. For optimal bilingual OCR with language priority overrides, place the language with the most complex script or the one dominating the document content first in the string, and use `~` to prevent unwanted sub-language interference.