# How to Configure Character Whitelists and Blacklists for Focused OCR in Tesseract

> Learn to configure character whitelists and blacklists in Tesseract OCR with SetVariable. Focus your OCR results by specifying allowed or disallowed characters for improved accuracy and efficiency.

- Repository: [tesseract-ocr/tesseract](https://github.com/tesseract-ocr/tesseract)
- Tags: how-to-guide
- Published: 2026-03-02

---

**Use `tessedit_char_whitelist` to restrict recognition to specific characters or `tessedit_char_blacklist` to exclude unwanted symbols, setting these parameters after initializing the Tesseract engine via `SetVariable()`.**

The tesseract-ocr/tesseract repository provides robust mechanisms to constrain optical character recognition to specific character sets. By configuring character whitelists and blacklists, you can eliminate recognition errors from similar-looking glyphs (like `O` versus `0`) and force the engine to focus only on relevant symbols for your use case.

## Understanding Tesseract Whitelist and Blacklist Parameters

Tesseract exposes two string parameters that control character admissibility:

- **`tessedit_char_whitelist`** — Defines the exclusive set of characters the OCR engine is permitted to recognize.
- **`tessedit_char_blacklist`** — Defines characters that must be excluded from recognition results.

Both parameters are classified as **non-init variables** in the Tesseract source code. This distinction is critical: these values cannot be set before the engine initializes. You must call `Init()` first, then apply the constraints using `SetVariable()`. Attempting to configure them prior to initialization will result in the settings being ignored.

## How Whitelists and Blacklists Work Internally

The implementation spans several core files in the tesseract-ocr/tesseract codebase. Understanding this flow helps debug configuration issues and optimize performance.

### Parameter Declaration

In [`src/ccmain/tesseractclass.h`](https://github.com/tesseract-ocr/tesseract/blob/main/src/ccmain/tesseractclass.h), the whitelist and blacklist are declared using the `STRING_VAR_H` macro (lines 73-75). This registers them as runtime-configurable string parameters within the `Tesseract` class's `Params` object.

### Runtime Application

When you call `TessBaseAPI::SetVariable()` (declared in [`include/tesseract/baseapi.h`](https://github.com/tesseract-ocr/tesseract/blob/main/include/tesseract/baseapi.h) and implemented in [`src/api/baseapi.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/api/baseapi.cpp) at line 207), the method forwards the name-value pair to `ParamUtils::SetParam()`. This updates the internal `Params` storage.

Before each recognition pass, [`src/api/baseapi.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/api/baseapi.cpp) (line 771) invokes `Tesseract::SetBlackAndWhitelist()`. This method (implemented in [`src/ccmain/tesseractclass.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/ccmain/tesseractclass.cpp) around line 535) retrieves the current values of `tessedit_char_blacklist` and `tessedit_char_whitelist` from the `Params` object. It then calls `UNICHARSET::set_black_and_whitelist()` on the main unicharset and every loaded sub-language's unicharset.

The `UNICHARSET` class uses these lists to filter admissible Unicode symbols during the classification phase. Any glyph whose code point is excluded by the whitelist or explicitly listed in the blacklist is filtered out, ensuring the recognizer only produces text consisting of allowed characters.

## Practical Implementation Methods

You can configure these constraints through multiple interfaces depending on your integration requirements.

### C++ API Approach

The C++ API provides the most direct control. Initialize the engine, then apply your constraints before processing images.

```cpp
#include <tesseract/baseapi.h>
#include <leptonica/allheaders>

int main() {
    tesseract::TessBaseAPI api;
    
    // Initialize with English language data
    if (api.Init("/usr/share/tessdata", "eng")) {
        fprintf(stderr, "Could not initialize tesseract.\n");
        return 1;
    }
    
    // Restrict recognition to digits only
    api.SetVariable("tessedit_char_whitelist", "0123456789");
    
    // Alternatively, blacklist specific problematic characters
    // api.SetVariable("tessedit_char_blacklist", "OI");
    
    Pix *image = pixRead("input.png");
    api.SetImage(image);
    
    char *text = api.GetUTF8Text();
    printf("Recognized text: %s\n", text);
    
    delete [] text;
    pixDestroy(&image);
    api.End();
    
    return 0;
}

```

### C API Integration

For applications using the C wrapper, the workflow is identical: create the handle, initialize, set variables, then process.

```c
#include <tesseract/capi.h>
#include <leptonica/allheaders>
#include <stdio.h>

int main() {
    TessBaseAPI *handle = TessBaseAPICreate();
    
    if (TessBaseAPIInit3(handle, "/usr/share/tessdata", "eng") != 0) {
        fprintf(stderr, "Initialization failed\n");
        return 1;
    }
    
    /* Restrict to uppercase letters A-Z */
    TessBaseAPISetVariable(handle, "tessedit_char_whitelist", 
                          "ABCDEFGHIJKLMNOPQRSTUVWXYZ");
    
    Pix *image = pixRead("input.png");
    TessBaseAPISetImage(handle, image);
    
    char *text = TessBaseAPIGetUTF8Text(handle);
    printf("%s\n", text);
    
    TessDeleteText(text);
    pixDestroy(&image);
    TessBaseAPIDelete(handle);
    
    return 0;
}

```

### Command-Line Usage

For quick testing or batch processing, use the `-c` flag to set configuration variables directly.

```bash

# Whitelist only digits and the dash character

tesseract input.png out -c tessedit_char_whitelist=0123456789-

# Blacklist specific punctuation to prevent confusion

tesseract input.png out -c tessedit_char_blacklist=.,;:!?

```

### Configuration File Method

For reusable configurations, create a text file with your constraints. The Tesseract repository includes an example in `tessdata/configs/digits`.

Create a file named [`mydigits.cfg`](https://github.com/tesseract-ocr/tesseract/blob/main/mydigits.cfg):

```

tessedit_char_whitelist 0123456789-

```

Run Tesseract with the config file:

```bash
tesseract input.png out mydigits.cfg

```

This approach is ideal for deployment scenarios where you want to version-control your OCR constraints separately from application code.

## Summary

- **Use `tessedit_char_whitelist`** to define an exclusive set of allowed characters, or **`tessedit_char_blacklist`** to exclude specific symbols.
- **Set these parameters after initialization** via `SetVariable()` (C++) or `TessBaseAPISetVariable()` (C); they are non-init variables that take effect only post-`Init()`.
- **Internal implementation** routes these values through [`src/api/baseapi.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/api/baseapi.cpp) to `Tesseract::SetBlackAndWhitelist()` in [`src/ccmain/tesseractclass.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/ccmain/tesseractclass.cpp), which updates the `UNICHARSET` constraints before recognition.
- **Apply constraints dynamically** without reinitializing the engine, enabling runtime switching between different character sets for batch processing diverse document types.

## Frequently Asked Questions

### Can I set whitelists before calling Init()?

No. The `tessedit_char_whitelist` and `tessedit_char_blacklist` parameters are classified as **non-init variables** in the Tesseract source code. You must initialize the engine with `Init()` first, then call `SetVariable()` to apply these constraints. Setting them before initialization will result in the values being ignored during recognition.

### Do whitelists work with all Tesseract language models?

Yes. The whitelist and blacklist constraints are applied at the `UNICHARSET` level in [`src/ccmain/tesseractclass.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/ccmain/tesseractclass.cpp). Since every language model loads its own unicharset, the filtering logic affects all loaded languages and sub-languages equally. However, ensure your whitelist characters exist in the target language's character set; otherwise, the engine may return empty results.

### How do I whitelist special characters or spaces?

Include the literal characters directly in the whitelist string. For spaces, include the space character within the quotes. For special regex-sensitive characters like backslashes or quotes, ensure proper escaping according to your programming language's string rules. For example, in C++: `api.SetVariable("tessedit_char_whitelist", "ABC 123-");` includes uppercase letters, space, digits, and hyphens.

### Can I combine whitelists and blacklists simultaneously?

Yes. Tesseract processes both parameters through `Tesseract::SetBlackAndWhitelist()`, which calls `UNICHARSET::set_black_and_whitelist()`. When both are specified, the engine applies both constraints: the whitelist restricts recognition to the allowed set, while the blacklist removes specific characters from that set (or from the full set if no whitelist is defined). This allows fine-grained control, such as whitelisting all alphanumeric characters but blacklisting easily confused pairs like `O` and `0`.