How to Configure Character Whitelists and Blacklists for Focused OCR in Tesseract
Use tessedit_char_whitelist to restrict recognition to specific characters or tessedit_char_blacklist to exclude unwanted symbols, setting these parameters after initializing the Tesseract engine via SetVariable().
The tesseract-ocr/tesseract repository provides robust mechanisms to constrain optical character recognition to specific character sets. By configuring character whitelists and blacklists, you can eliminate recognition errors from similar-looking glyphs (like O versus 0) and force the engine to focus only on relevant symbols for your use case.
Understanding Tesseract Whitelist and Blacklist Parameters
Tesseract exposes two string parameters that control character admissibility:
tessedit_char_whitelist— Defines the exclusive set of characters the OCR engine is permitted to recognize.tessedit_char_blacklist— Defines characters that must be excluded from recognition results.
Both parameters are classified as non-init variables in the Tesseract source code. This distinction is critical: these values cannot be set before the engine initializes. You must call Init() first, then apply the constraints using SetVariable(). Attempting to configure them prior to initialization will result in the settings being ignored.
How Whitelists and Blacklists Work Internally
The implementation spans several core files in the tesseract-ocr/tesseract codebase. Understanding this flow helps debug configuration issues and optimize performance.
Parameter Declaration
In src/ccmain/tesseractclass.h, the whitelist and blacklist are declared using the STRING_VAR_H macro (lines 73-75). This registers them as runtime-configurable string parameters within the Tesseract class's Params object.
Runtime Application
When you call TessBaseAPI::SetVariable() (declared in include/tesseract/baseapi.h and implemented in src/api/baseapi.cpp at line 207), the method forwards the name-value pair to ParamUtils::SetParam(). This updates the internal Params storage.
Before each recognition pass, src/api/baseapi.cpp (line 771) invokes Tesseract::SetBlackAndWhitelist(). This method (implemented in src/ccmain/tesseractclass.cpp around line 535) retrieves the current values of tessedit_char_blacklist and tessedit_char_whitelist from the Params object. It then calls UNICHARSET::set_black_and_whitelist() on the main unicharset and every loaded sub-language's unicharset.
The UNICHARSET class uses these lists to filter admissible Unicode symbols during the classification phase. Any glyph whose code point is excluded by the whitelist or explicitly listed in the blacklist is filtered out, ensuring the recognizer only produces text consisting of allowed characters.
Practical Implementation Methods
You can configure these constraints through multiple interfaces depending on your integration requirements.
C++ API Approach
The C++ API provides the most direct control. Initialize the engine, then apply your constraints before processing images.
#include <tesseract/baseapi.h>
#include <leptonica/allheaders>
int main() {
tesseract::TessBaseAPI api;
// Initialize with English language data
if (api.Init("/usr/share/tessdata", "eng")) {
fprintf(stderr, "Could not initialize tesseract.\n");
return 1;
}
// Restrict recognition to digits only
api.SetVariable("tessedit_char_whitelist", "0123456789");
// Alternatively, blacklist specific problematic characters
// api.SetVariable("tessedit_char_blacklist", "OI");
Pix *image = pixRead("input.png");
api.SetImage(image);
char *text = api.GetUTF8Text();
printf("Recognized text: %s\n", text);
delete [] text;
pixDestroy(&image);
api.End();
return 0;
}
C API Integration
For applications using the C wrapper, the workflow is identical: create the handle, initialize, set variables, then process.
#include <tesseract/capi.h>
#include <leptonica/allheaders>
#include <stdio.h>
int main() {
TessBaseAPI *handle = TessBaseAPICreate();
if (TessBaseAPIInit3(handle, "/usr/share/tessdata", "eng") != 0) {
fprintf(stderr, "Initialization failed\n");
return 1;
}
/* Restrict to uppercase letters A-Z */
TessBaseAPISetVariable(handle, "tessedit_char_whitelist",
"ABCDEFGHIJKLMNOPQRSTUVWXYZ");
Pix *image = pixRead("input.png");
TessBaseAPISetImage(handle, image);
char *text = TessBaseAPIGetUTF8Text(handle);
printf("%s\n", text);
TessDeleteText(text);
pixDestroy(&image);
TessBaseAPIDelete(handle);
return 0;
}
Command-Line Usage
For quick testing or batch processing, use the -c flag to set configuration variables directly.
# Whitelist only digits and the dash character
tesseract input.png out -c tessedit_char_whitelist=0123456789-
# Blacklist specific punctuation to prevent confusion
tesseract input.png out -c tessedit_char_blacklist=.,;:!?
Configuration File Method
For reusable configurations, create a text file with your constraints. The Tesseract repository includes an example in tessdata/configs/digits.
Create a file named mydigits.cfg:
tessedit_char_whitelist 0123456789-
Run Tesseract with the config file:
tesseract input.png out mydigits.cfg
This approach is ideal for deployment scenarios where you want to version-control your OCR constraints separately from application code.
Summary
- Use
tessedit_char_whitelistto define an exclusive set of allowed characters, ortessedit_char_blacklistto exclude specific symbols. - Set these parameters after initialization via
SetVariable()(C++) orTessBaseAPISetVariable()(C); they are non-init variables that take effect only post-Init(). - Internal implementation routes these values through
src/api/baseapi.cpptoTesseract::SetBlackAndWhitelist()insrc/ccmain/tesseractclass.cpp, which updates theUNICHARSETconstraints before recognition. - Apply constraints dynamically without reinitializing the engine, enabling runtime switching between different character sets for batch processing diverse document types.
Frequently Asked Questions
Can I set whitelists before calling Init()?
No. The tessedit_char_whitelist and tessedit_char_blacklist parameters are classified as non-init variables in the Tesseract source code. You must initialize the engine with Init() first, then call SetVariable() to apply these constraints. Setting them before initialization will result in the values being ignored during recognition.
Do whitelists work with all Tesseract language models?
Yes. The whitelist and blacklist constraints are applied at the UNICHARSET level in src/ccmain/tesseractclass.cpp. Since every language model loads its own unicharset, the filtering logic affects all loaded languages and sub-languages equally. However, ensure your whitelist characters exist in the target language's character set; otherwise, the engine may return empty results.
How do I whitelist special characters or spaces?
Include the literal characters directly in the whitelist string. For spaces, include the space character within the quotes. For special regex-sensitive characters like backslashes or quotes, ensure proper escaping according to your programming language's string rules. For example, in C++: api.SetVariable("tessedit_char_whitelist", "ABC 123-"); includes uppercase letters, space, digits, and hyphens.
Can I combine whitelists and blacklists simultaneously?
Yes. Tesseract processes both parameters through Tesseract::SetBlackAndWhitelist(), which calls UNICHARSET::set_black_and_whitelist(). When both are specified, the engine applies both constraints: the whitelist restricts recognition to the allowed set, while the blacklist removes specific characters from that set (or from the full set if no whitelist is defined). This allows fine-grained control, such as whitelisting all alphanumeric characters but blacklisting easily confused pairs like O and 0.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →