how-to-guide

Image Preprocessing Techniques to Improve Tesseract OCR Accuracy: A Complete Technical Guide

March 2, 2026 tesseract-ocr/tesseract ↗

Tesseract OCR achieves maximum accuracy when processing binary, deskewed images scaled to approximately 300 DPI, using configurable thresholding methods like Sauvola or Otsu to handle varying illumination conditions.

The tesseract-ocr/tesseract repository implements a robust preprocessing pipeline directly within its C++ core. Understanding these internal mechanisms—available through both the API and command-line interface—allows you to optimize recognition accuracy without external dependencies. This guide examines the specific preprocessing stages implemented in the source code, including rescaling, deskewing, binarization, and noise reduction.

Why Image Preprocessing Matters for Tesseract OCR

Tesseract’s LSTM recognition engine was trained on images with specific characteristics: clean binary text, horizontal baselines, and consistent resolution near 300 pixels per inch (PPI). When input images deviate from these parameters—due to low resolution, skewed scanning, or uneven lighting—character error rates increase significantly. The preprocessing pipeline bridges this gap by normalizing input conditions before the LSTM layer processes the data.

Core Preprocessing Pipeline Overview

The Tesseract engine executes preprocessing in a specific sequence defined in src/ccmain/thresholder.cpp and src/ccstruct/imagedata.cpp:

Rescaling – Adjusts image resolution to target DPI
Deskewing – Rotates text lines to horizontal
Binarization – Converts grayscale to 1-bit using thresholding algorithms
Smoothing – Optional noise reduction on the binary image
Inversion – Handles photonegatives when detected

Each stage exposes configuration variables that control behavior without requiring code modification.

Rescaling and DPI Normalization

Tesseract requires text height to correspond approximately to 300 DPI for optimal recognition. The ImageData::PreScale function in src/ccstruct/imagedata.cpp (lines 217-267) implements intelligent rescaling:

// From src/ccstruct/imagedata.cpp
float ImageData::PreScale(int target_height, int max_height, 
                          float* scale, int* scaled_width, 
                          int* scaled_height, int* margin) {
  // Computes scale factor to reach target_height while preserving aspect ratio
  // Uses pixScale from Leptonica for actual resizing
}

Key implementation details:

The function calculates a scale factor based on target height (typically 1000-1500 pixels)
It preserves aspect ratio using Leptonica’s pixScale
Maximum height constraints prevent memory issues with oversized images

Configuration approach:

tesseract input.png output -l eng --psm 6 -c user_defined_dpi=300

Deskewing and Page Orientation

Skewed text lines severely impact line segmentation accuracy. The deskewing system in src/textord/tabfind.cpp (lines 1287-1294) uses vector analysis to compute rotation angles:

// From src/textord/tabfind.cpp
void TabFind::Deskew() {
  // Uses vertical_skew_ vector computed by ComputeDeskewVectors
  // Rotates all blobs and grid structures to horizontal
  FCOORD vertical_skew = ComputeDeskewVectors();
  // Apply rotation to normalize text lines
}

Technical details:

ComputeDeskewVectors analyzes text line angles across the page
The system handles both page-level skew and column-level variations
Rotation uses shear transformations to preserve image quality

Command-line control: While deskewing runs automatically, you can influence it through page segmentation modes:

tesseract skewed.png output -l eng --psm 3  # Automatic page segmentation with OSD

Binarization Methods: Otsu, Leptonica-Otsu, and Sauvola

Binarization converts grayscale images to black-and-white bitmaps required by the OCR engine. The ImageThresholder::Threshold function in src/ccmain/thresholder.cpp (lines 82-89) supports three algorithms:

// From src/ccmain/thresholder.cpp
void ImageThresholder::Threshold(Pix* pix, 
                                 int* out_threshold,
                                 int* out_hi_value,
                                 int* out_lo_value) {
  // thresholding_method: 0=Otsu, 1=LeptonicaOtsu, 2=Sauvola
  switch (thresholding_method) {
    case 0: // Otsu
    case 1: // Leptonica Otsu
    case 2: // Sauvola
  }
}

Algorithm selection guide:

Otsu (0): Global thresholding, optimal for high-contrast, uniform lighting
Leptonica-Otsu (1): Modified Otsu implementation with smoothing
Sauvola (2): Local adaptive thresholding, ideal for uneven illumination or shadows

Configuration variables (defined in src/ccmain/tesseractclass.cpp, lines 84-102):

// Control parameters exposed to users
INT_VAR_H(thresholding_method, 0, "Thresholding method");
DOUBLE_VAR_H(thresholding_window_size, 0.33, "Window size for Sauvola");
DOUBLE_VAR_H(thresholding_kfactor, 0.34, "Sauvola k factor");

Practical configuration:


# For scanned documents with shadows

tesseract uneven.png output -l eng \
  -c thresholding_method=2 \
  -c thresholding_window_size=0.33 \
  -c thresholding_kfactor=0.34

Noise Reduction and Smoothing

Before final binarization, Tesseract can apply smoothing to reduce high-frequency noise. The thresholding_smooth_kernel_size parameter controls this behavior:

Implementation location: src/ccmain/tesseractclass.cpp (lines 107-110) defines the smoothing kernel size used by the Leptonica-Otsu method.

Effect:

Removes isolated pixel spikes that could be misinterpreted as characters
Applies a small convolution kernel to the threshold score map
Particularly useful for scanned documents with paper texture or JPEG artifacts

Configuration:

tesseract noisy.png output -l eng \
  -c thresholding_smooth_kernel_size=1.0

Handling Photonegatives and Inversion

When processing photonegatives (white text on black background), Tesseract must invert the image to maintain correct foreground/background logic. The inversion logic appears in src/training/degradeimage.cpp (lines 13-15):

// From src/training/degradeimage.cpp
if (invert) {
  pixInvert(pix, pix);  // Leptonica function to flip black/white
}

Detection and handling:

Tesseract automatically detects inversion in some modes, but explicit control ensures accuracy
Inversion must occur before binarization to ensure proper threshold calculations
Critical for microfilm, photographic negatives, or certain medical imaging formats

Command-line approach: While Tesseract doesn't expose a direct "invert" flag for inference, you can preprocess with Leptonica or ImageMagick:


# Using ImageMagick to invert before Tesseract

convert negative.png -negate positive.png
tesseract positive.png output -l eng

Practical Implementation: Code Examples

C++ API: Explicit Preprocessing Control

For applications requiring fine-grained control, the C++ API exposes the internal preprocessing pipeline:

#include <tesseract/baseapi.h>
#include <leptonica/allheaders.h>

int main() {
  tesseract::TessBaseAPI api;
  api.Init(nullptr, "eng");

  // Load image
  Image pix = pixRead("document.tif");

  // Configure preprocessing parameters
  api.SetVariable("thresholding_method", "2");           // Sauvola
  api.SetVariable("thresholding_window_size", "0.33");  // 33% of DPI
  api.SetVariable("thresholding_kfactor", "0.34");     // Sauvola k parameter
  api.SetVariable("thresholding_smooth_kernel_size", "1.0");

  // Process image
  api.SetImage(pix);
  const char* text = api.GetUTF8Text();
  
  printf("%s\n", text);
  delete [] text;
  pixDestroy(&pix);
  return 0;
}

Command-Line: Configuration Variables

For batch processing or scripting, use the -c flag to override preprocessing defaults:


# Standard processing

tesseract scan.png output -l eng

# Optimized for uneven lighting (Sauvola thresholding)

tesseract scan.png output -l eng \
  -c thresholding_method=2 \
  -c thresholding_window_size=0.33 \
  -c thresholding_kfactor=0.34

# With noise reduction for scanned documents

tesseract noisy_scan.png output -l eng \
  -c thresholding_smooth_kernel_size=1.0 \
  -c thresholding_method=1

Python: Custom Preprocessing with pytesseract

When you need preprocessing beyond Tesseract's built-in capabilities, combine OpenCV or scikit-image with pytesseract:

import pytesseract
import cv2
import numpy as np
from skimage.filters import threshold_sauvola

# Load image

img = cv2.imread('document.jpg', cv2.IMREAD_GRAYSCALE)

# 1. Rescale to ~300 DPI equivalent (target height 1000px)

h, w = img.shape
scale_factor = 1000 / h
img_resized = cv2.resize(img, (int(w * scale_factor), 1000), 
                         interpolation=cv2.INTER_LINEAR)

# 2. Deskew using moments

coords = np.column_stack(np.where(img_resized > 0))
angle = cv2.minAreaRect(coords)[-1]
if angle < -45:
    angle = -(90 + angle)
else:
    angle = -angle
    
(h, w) = img_resized.shape[:2]
center = (w // 2, h // 2)
M = cv2.getRotationMatrix2D(center, angle, 1.0)
img_deskew = cv2.warpAffine(img_resized, M, (w, h),
                            flags=cv2.INTER_CUBIC,
                            borderMode=cv2.BORDER_REPLICATE)

# 3. Sauvola binarization (adaptive for uneven lighting)

window_size = int(0.33 * 300)  # 33% of 300 DPI

thresh = threshold_sauvola(img_deskew, window_size=window_size, k=0.34)
binary = (img_deskew > thresh).astype(np.uint8) * 255

# 4. OCR with pytesseract

text = pytesseract.image_to_string(binary, lang='eng')
print(text)

This approach mirrors the internal C++ pipeline while giving you explicit control over each preprocessing stage using Python's computer vision ecosystem.

Summary

Effective image preprocessing techniques improve Tesseract OCR accuracy by normalizing input to match the conditions of the training data. The key optimizations include:

Rescaling to 300 DPI using ImageData::PreScale in src/ccstruct/imagedata.cpp to ensure proper character height for the LSTM models
Deskewing text lines via TabFind::Deskew in src/textord/tabfind.cpp to correct rotational errors that break line segmentation
Adaptive binarization through ImageThresholder::Threshold in src/ccmain/thresholder.cpp, selecting Sauvola for uneven lighting or Otsu for high-contrast documents
Noise reduction by configuring thresholding_smooth_kernel_size to eliminate high-frequency artifacts before character recognition
Inversion handling for photonegatives using pixInvert logic from src/training/degradeimage.cpp to ensure correct foreground/background detection

Frequently Asked Questions

What is the optimal DPI for Tesseract OCR images?

Tesseract’s LSTM models were trained on images with approximately 300 pixels per inch (PPI). The ImageData::PreScale function in src/ccstruct/imagedata.cpp automatically scales input images to achieve this effective resolution. For best results, ensure your source material scales to a text line height of roughly 30-40 pixels, which corresponds to the 300 DPI training standard.

How does Tesseract handle uneven lighting or shadows?

For documents with uneven illumination, configure Sauvola thresholding instead of the default Otsu method. Set thresholding_method=2 and adjust thresholding_window_size (typically 0.33 for 33% of DPI) and thresholding_kfactor (default 0.34). The ImageThresholder::Threshold implementation in src/ccmain/thresholder.cpp applies these local adaptive thresholds, calculating mean and standard deviation within sliding windows to compensate for lighting variations.

Can Tesseract automatically deskew rotated documents?

Yes, Tesseract automatically computes deskew vectors during page layout analysis. The TabFind::Deskew function in src/textord/tabfind.cpp calculates the vertical skew vector using ComputeDeskewVectors, then rotates all text blobs and grid structures to horizontal. This occurs automatically when using standard page segmentation modes (PSM 1, 3, or 6), though severe rotations may require manual correction or preprocessing with Leptonica’s pixDeskew before passing to Tesseract.

What preprocessing should I apply before calling Tesseract?

Ideally, minimal external preprocessing is needed if you configure Tesseract’s internal pipeline correctly. Ensure you: (1) provide images with sufficient resolution (Tesseract will rescale via PreScale if needed), (2) set appropriate thresholding parameters for your lighting conditions, and (3) enable smoothing (thresholding_smooth_kernel_size=1) for noisy scans. For specialized cases like photonegatives, invert the image beforehand or use training tools from src/training/degradeimage.cpp as reference for inversion logic.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:

curl -s "https://instagit.com/install.md"

Add to your MCP client configuration:

{
  "mcpServers": {
    "instagit": {
      "command": "npx",
      "args": ["-y", "instagit@latest"]
    }
  }
}

Ask your agent:

"Use Instagit MCP to understand how tesseract-ocr/tesseract works."

Works with

Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →