# Image Preprocessing Techniques to Improve Tesseract OCR Accuracy: A Complete Technical Guide

> Boost Tesseract OCR accuracy with essential preprocessing techniques. Learn about deskewing, scaling, and thresholding for optimal results in this comprehensive guide.

- Repository: [tesseract-ocr/tesseract](https://github.com/tesseract-ocr/tesseract)
- Tags: how-to-guide
- Published: 2026-03-02

---

**Tesseract OCR achieves maximum accuracy when processing binary, deskewed images scaled to approximately 300 DPI, using configurable thresholding methods like Sauvola or Otsu to handle varying illumination conditions.**

The `tesseract-ocr/tesseract` repository implements a robust preprocessing pipeline directly within its C++ core. Understanding these internal mechanisms—available through both the API and command-line interface—allows you to optimize recognition accuracy without external dependencies. This guide examines the specific preprocessing stages implemented in the source code, including rescaling, deskewing, binarization, and noise reduction.

## Why Image Preprocessing Matters for Tesseract OCR

Tesseract’s LSTM recognition engine was trained on images with specific characteristics: clean binary text, horizontal baselines, and consistent resolution near 300 pixels per inch (PPI). When input images deviate from these parameters—due to low resolution, skewed scanning, or uneven lighting—character error rates increase significantly. The preprocessing pipeline bridges this gap by normalizing input conditions before the LSTM layer processes the data.

## Core Preprocessing Pipeline Overview

The Tesseract engine executes preprocessing in a specific sequence defined in [`src/ccmain/thresholder.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/ccmain/thresholder.cpp) and [`src/ccstruct/imagedata.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/ccstruct/imagedata.cpp):

1. **Rescaling** – Adjusts image resolution to target DPI
2. **Deskewing** – Rotates text lines to horizontal
3. **Binarization** – Converts grayscale to 1-bit using thresholding algorithms
4. **Smoothing** – Optional noise reduction on the binary image
5. **Inversion** – Handles photonegatives when detected

Each stage exposes configuration variables that control behavior without requiring code modification.

## Rescaling and DPI Normalization

Tesseract requires text height to correspond approximately to 300 DPI for optimal recognition. The `ImageData::PreScale` function in [`src/ccstruct/imagedata.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/ccstruct/imagedata.cpp) (lines 217-267) implements intelligent rescaling:

```cpp
// From src/ccstruct/imagedata.cpp
float ImageData::PreScale(int target_height, int max_height, 
                          float* scale, int* scaled_width, 
                          int* scaled_height, int* margin) {
  // Computes scale factor to reach target_height while preserving aspect ratio
  // Uses pixScale from Leptonica for actual resizing
}

```

**Key implementation details:**
- The function calculates a scale factor based on target height (typically 1000-1500 pixels)
- It preserves aspect ratio using Leptonica’s `pixScale`
- Maximum height constraints prevent memory issues with oversized images

**Configuration approach:**

```bash
tesseract input.png output -l eng --psm 6 -c user_defined_dpi=300

```

## Deskewing and Page Orientation

Skewed text lines severely impact line segmentation accuracy. The deskewing system in [`src/textord/tabfind.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/textord/tabfind.cpp) (lines 1287-1294) uses vector analysis to compute rotation angles:

```cpp
// From src/textord/tabfind.cpp
void TabFind::Deskew() {
  // Uses vertical_skew_ vector computed by ComputeDeskewVectors
  // Rotates all blobs and grid structures to horizontal
  FCOORD vertical_skew = ComputeDeskewVectors();
  // Apply rotation to normalize text lines
}

```

**Technical details:**
- `ComputeDeskewVectors` analyzes text line angles across the page
- The system handles both page-level skew and column-level variations
- Rotation uses shear transformations to preserve image quality

**Command-line control:**
While deskewing runs automatically, you can influence it through page segmentation modes:

```bash
tesseract skewed.png output -l eng --psm 3  # Automatic page segmentation with OSD

```

## Binarization Methods: Otsu, Leptonica-Otsu, and Sauvola

Binarization converts grayscale images to black-and-white bitmaps required by the OCR engine. The `ImageThresholder::Threshold` function in [`src/ccmain/thresholder.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/ccmain/thresholder.cpp) (lines 82-89) supports three algorithms:

```cpp
// From src/ccmain/thresholder.cpp
void ImageThresholder::Threshold(Pix* pix, 
                                 int* out_threshold,
                                 int* out_hi_value,
                                 int* out_lo_value) {
  // thresholding_method: 0=Otsu, 1=LeptonicaOtsu, 2=Sauvola
  switch (thresholding_method) {
    case 0: // Otsu
    case 1: // Leptonica Otsu
    case 2: // Sauvola
  }
}

```

**Algorithm selection guide:**
- **Otsu (0):** Global thresholding, optimal for high-contrast, uniform lighting
- **Leptonica-Otsu (1):** Modified Otsu implementation with smoothing
- **Sauvola (2):** Local adaptive thresholding, ideal for uneven illumination or shadows

**Configuration variables** (defined in [`src/ccmain/tesseractclass.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/ccmain/tesseractclass.cpp), lines 84-102):

```cpp
// Control parameters exposed to users
INT_VAR_H(thresholding_method, 0, "Thresholding method");
DOUBLE_VAR_H(thresholding_window_size, 0.33, "Window size for Sauvola");
DOUBLE_VAR_H(thresholding_kfactor, 0.34, "Sauvola k factor");

```

**Practical configuration:**

```bash

# For scanned documents with shadows

tesseract uneven.png output -l eng \
  -c thresholding_method=2 \
  -c thresholding_window_size=0.33 \
  -c thresholding_kfactor=0.34

```

## Noise Reduction and Smoothing

Before final binarization, Tesseract can apply smoothing to reduce high-frequency noise. The `thresholding_smooth_kernel_size` parameter controls this behavior:

**Implementation location:** [`src/ccmain/tesseractclass.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/ccmain/tesseractclass.cpp) (lines 107-110) defines the smoothing kernel size used by the Leptonica-Otsu method.

**Effect:**
- Removes isolated pixel spikes that could be misinterpreted as characters
- Applies a small convolution kernel to the threshold score map
- Particularly useful for scanned documents with paper texture or JPEG artifacts

**Configuration:**

```bash
tesseract noisy.png output -l eng \
  -c thresholding_smooth_kernel_size=1.0

```

## Handling Photonegatives and Inversion

When processing photonegatives (white text on black background), Tesseract must invert the image to maintain correct foreground/background logic. The inversion logic appears in [`src/training/degradeimage.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/training/degradeimage.cpp) (lines 13-15):

```cpp
// From src/training/degradeimage.cpp
if (invert) {
  pixInvert(pix, pix);  // Leptonica function to flip black/white
}

```

**Detection and handling:**
- Tesseract automatically detects inversion in some modes, but explicit control ensures accuracy
- Inversion must occur before binarization to ensure proper threshold calculations
- Critical for microfilm, photographic negatives, or certain medical imaging formats

**Command-line approach:**
While Tesseract doesn't expose a direct "invert" flag for inference, you can preprocess with Leptonica or ImageMagick:

```bash

# Using ImageMagick to invert before Tesseract

convert negative.png -negate positive.png
tesseract positive.png output -l eng

```

## Practical Implementation: Code Examples

### C++ API: Explicit Preprocessing Control

For applications requiring fine-grained control, the C++ API exposes the internal preprocessing pipeline:

```cpp
#include <tesseract/baseapi.h>
#include <leptonica/allheaders.h>

int main() {
  tesseract::TessBaseAPI api;
  api.Init(nullptr, "eng");

  // Load image
  Image pix = pixRead("document.tif");

  // Configure preprocessing parameters
  api.SetVariable("thresholding_method", "2");           // Sauvola
  api.SetVariable("thresholding_window_size", "0.33");  // 33% of DPI
  api.SetVariable("thresholding_kfactor", "0.34");     // Sauvola k parameter
  api.SetVariable("thresholding_smooth_kernel_size", "1.0");

  // Process image
  api.SetImage(pix);
  const char* text = api.GetUTF8Text();
  
  printf("%s\n", text);
  delete [] text;
  pixDestroy(&pix);
  return 0;
}

```

### Command-Line: Configuration Variables

For batch processing or scripting, use the `-c` flag to override preprocessing defaults:

```bash

# Standard processing

tesseract scan.png output -l eng

# Optimized for uneven lighting (Sauvola thresholding)

tesseract scan.png output -l eng \
  -c thresholding_method=2 \
  -c thresholding_window_size=0.33 \
  -c thresholding_kfactor=0.34

# With noise reduction for scanned documents

tesseract noisy_scan.png output -l eng \
  -c thresholding_smooth_kernel_size=1.0 \
  -c thresholding_method=1

```

### Python: Custom Preprocessing with pytesseract

When you need preprocessing beyond Tesseract's built-in capabilities, combine OpenCV or scikit-image with pytesseract:

```python
import pytesseract
import cv2
import numpy as np
from skimage.filters import threshold_sauvola

# Load image

img = cv2.imread('document.jpg', cv2.IMREAD_GRAYSCALE)

# 1. Rescale to ~300 DPI equivalent (target height 1000px)

h, w = img.shape
scale_factor = 1000 / h
img_resized = cv2.resize(img, (int(w * scale_factor), 1000), 
                         interpolation=cv2.INTER_LINEAR)

# 2. Deskew using moments

coords = np.column_stack(np.where(img_resized > 0))
angle = cv2.minAreaRect(coords)[-1]
if angle < -45:
    angle = -(90 + angle)
else:
    angle = -angle
    
(h, w) = img_resized.shape[:2]
center = (w // 2, h // 2)
M = cv2.getRotationMatrix2D(center, angle, 1.0)
img_deskew = cv2.warpAffine(img_resized, M, (w, h),
                            flags=cv2.INTER_CUBIC,
                            borderMode=cv2.BORDER_REPLICATE)

# 3. Sauvola binarization (adaptive for uneven lighting)

window_size = int(0.33 * 300)  # 33% of 300 DPI

thresh = threshold_sauvola(img_deskew, window_size=window_size, k=0.34)
binary = (img_deskew > thresh).astype(np.uint8) * 255

# 4. OCR with pytesseract

text = pytesseract.image_to_string(binary, lang='eng')
print(text)

```

This approach mirrors the internal C++ pipeline while giving you explicit control over each preprocessing stage using Python's computer vision ecosystem.

## Summary

Effective **image preprocessing techniques improve Tesseract OCR accuracy** by normalizing input to match the conditions of the training data. The key optimizations include:

- **Rescaling to 300 DPI** using `ImageData::PreScale` in [`src/ccstruct/imagedata.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/ccstruct/imagedata.cpp) to ensure proper character height for the LSTM models
- **Deskewing text lines** via `TabFind::Deskew` in [`src/textord/tabfind.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/textord/tabfind.cpp) to correct rotational errors that break line segmentation
- **Adaptive binarization** through `ImageThresholder::Threshold` in [`src/ccmain/thresholder.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/ccmain/thresholder.cpp), selecting **Sauvola** for uneven lighting or **Otsu** for high-contrast documents
- **Noise reduction** by configuring `thresholding_smooth_kernel_size` to eliminate high-frequency artifacts before character recognition
- **Inversion handling** for photonegatives using `pixInvert` logic from [`src/training/degradeimage.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/training/degradeimage.cpp) to ensure correct foreground/background detection

## Frequently Asked Questions

### What is the optimal DPI for Tesseract OCR images?

Tesseract’s LSTM models were trained on images with approximately 300 pixels per inch (PPI). The `ImageData::PreScale` function in [`src/ccstruct/imagedata.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/ccstruct/imagedata.cpp) automatically scales input images to achieve this effective resolution. For best results, ensure your source material scales to a text line height of roughly 30-40 pixels, which corresponds to the 300 DPI training standard.

### How does Tesseract handle uneven lighting or shadows?

For documents with uneven illumination, configure **Sauvola thresholding** instead of the default Otsu method. Set `thresholding_method=2` and adjust `thresholding_window_size` (typically 0.33 for 33% of DPI) and `thresholding_kfactor` (default 0.34). The `ImageThresholder::Threshold` implementation in [`src/ccmain/thresholder.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/ccmain/thresholder.cpp) applies these local adaptive thresholds, calculating mean and standard deviation within sliding windows to compensate for lighting variations.

### Can Tesseract automatically deskew rotated documents?

Yes, Tesseract automatically computes deskew vectors during page layout analysis. The `TabFind::Deskew` function in [`src/textord/tabfind.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/textord/tabfind.cpp) calculates the vertical skew vector using `ComputeDeskewVectors`, then rotates all text blobs and grid structures to horizontal. This occurs automatically when using standard page segmentation modes (PSM 1, 3, or 6), though severe rotations may require manual correction or preprocessing with Leptonica’s `pixDeskew` before passing to Tesseract.

### What preprocessing should I apply before calling Tesseract?

Ideally, minimal external preprocessing is needed if you configure Tesseract’s internal pipeline correctly. Ensure you: (1) provide images with sufficient resolution (Tesseract will rescale via `PreScale` if needed), (2) set appropriate thresholding parameters for your lighting conditions, and (3) enable smoothing (`thresholding_smooth_kernel_size=1`) for noisy scans. For specialized cases like photonegatives, invert the image beforehand or use training tools from [`src/training/degradeimage.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/training/degradeimage.cpp) as reference for inversion logic.