How to Implement Page-by-Page OCR Processing for Multi-Page Documents with Tesseract

Tesseract processes multi-page documents by iterating through individual images using the ProcessPage method, with high-level helpers like ProcessPages and ProcessPagesMultipageTiff in src/api/baseapi.cpp handling the pagination logic automatically.

The Tesseract OCR engine is designed to process single images, but the public API provides robust support for page-by-page OCR processing for multi-page documents with Tesseract through thin wrapper methods that manage iteration and memory. Understanding these internal mechanisms allows developers to implement custom pagination, error handling, and output formatting while leveraging the optimized recognition pipeline in the tesseract-ocr/tesseract repository.

How Tesseract Detects and Routes Multi-Page Input

The entry point TessBaseAPI::ProcessPages forwards to ProcessPagesInternal (lines 29‑31 of [src/api/baseapi.cpp](https://github.com/tesseract-ocr/tesseract/blob/main/src/api/baseapi.cpp#L29)). This internal method performs three mutually exclusive branches based on input format detection (lines 30‑44):

Branch Condition Implementation
File-list (stdin) stream_filelist flag is true Reads file names line‑by‑line and calls ProcessPagesFileList for each entry.
Multipage TIFF format detected as IFF_TIFF* variant Calls ProcessPagesMultipageTiff, which extracts each image page and invokes ProcessPage.
Single image Any other format (PNG, JPEG, PDF, etc.) Loads the image into a Pix and calls ProcessPage once.

The Per-Page OCR Pipeline in ProcessPage

TessBaseAPI::ProcessPage implements the actual OCR steps for one page (lines 95‑119 of [baseapi.cpp](https://github.com/tesseract-ocr/tesseract/blob/main/src/api/baseapi.cpp#L95-L119)):

  1. Set imageSetImage(pix) copies the Leptonica Pix into the internal Thresholder.
  2. Run layout analysisAnalyseLayout() (or FindLines()) builds the spatial hierarchy (PAGE_RES).
  3. Run recognitionRecognize() executes the neural‑network or legacy recognizer.
  4. Optional retry – If a retry_config file is supplied and the first attempt fails, the method reloads the saved variable set, applies the retry configuration, and runs OCR again.
  5. Render output – If a TessResultRenderer is passed, renderer->AddImage(this) writes the result in the requested format (hOCR, PDF, TSV, etc.).

All of this is wrapped by a C interface in [src/api/capi.cpp](https://github.com/tesseract-ocr/tesseract/blob/main/src/api/capi.cpp) – e.g. TessBaseAPIProcessPage (lines 400‑408) – so the same behaviour is available from C, C++, Python, Java, etc.

Implementing Manual Page-by-Page Control

Because the high‑level ProcessPages* helpers are thin wrappers, you can drive the per‑page loop yourself:

#include <tesseract/baseapi.h>
#include <leptonica/allheaders.h>

int main() {
    tesseract::TessBaseAPI api;
    api.Init("/usr/share/tesseract-ocr/4.00/tessdata", "eng");
    api.SetPageSegMode(tesseract::PSM_AUTO);

    const char *filename = "multi.tif";
    Pix *pix = nullptr;
    size_t offset = 0;  // Leptonica uses this to locate next page
    int page = 0;

    while ((pix = pixReadFromMultipageTiff(filename, &offset)) != nullptr) {
        api.ProcessPage(pix, page, filename, nullptr, 0, nullptr);
        char *utf8 = api.GetUTF8Text();
        printf("=== Page %d ===\n%s\n", page + 1, utf8);
        delete [] utf8;
        pixDestroy(&pix);
        if (!offset) break;  // no more pages
        ++page;
    }
    api.End();
    return 0;
}

Using the C API for Language Bindings

If you prefer the C API (e.g. from Python’s ctypes), the equivalent call is:

import ctypes
from ctypes import c_char_p, c_int, c_void_p, c_size_t, byref

lib = ctypes.CDLL('libtesseract.so')

# Function signatures

lib.TessBaseAPICreate.restype = c_void_p
lib.TessBaseAPIInit3.argtypes = [c_void_p, c_char_p, c_char_p]
lib.TessBaseAPIProcessPage.argtypes = [c_void_p, c_void_p, c_int,
                                      c_char_p, c_char_p, c_int, c_void_p]
lib.TessBaseAPIGetUTF8Text.restype = c_char_p

handle = lib.TessBaseAPICreate()
datapath = b"/usr/share/tesseract-ocr/4.00/tessdata"
lib.TessBaseAPIInit3(handle, datapath, b"eng")

filename = b"multi.tif"
page = 0
offset = c_size_t(0)

while True:
    # Assuming leptonica functions are also wrapped

    pix = lib.pixReadFromMultipageTiff(filename, byref(offset))
    if not pix:
        break
    lib.TessBaseAPIProcessPage(handle, pix, page, filename, None, 0, None)
    txt = lib.TessBaseAPIGetUTF8Text(handle)
    print(f"--- Page {page+1} ---\n{ctypes.string_at(txt).decode()}")
    lib.TessDeleteText(txt)
    lib.pixDestroy(byref(pix))
    if offset.value == 0:
        break
    page += 1

lib.TessBaseAPIEnd(handle)
lib.TessBaseAPIDelete(handle)

(See TessBaseAPIProcessPage in [capi.cpp](https://github.com/tesseract-ocr/tesseract/blob/main/src/api/capi.cpp#L404-L408).)

Automatic Multi-Page Output with Renderers

If you hand a TessResultRenderer to ProcessPages (or ProcessPage) you get automatic multi‑page output in PDF, hOCR, TSV, etc., without writing any loop logic:

#include <tesseract/baseapi.h>
#include <tesseract/renderer.h>

int main() {
    tesseract::TessBaseAPI api;
    api.Init(nullptr, "eng");
    // Create a PDF renderer that writes to "out.pdf"
    tesseract::TessResultRenderer *renderer =
        TessPDFRendererCreate("out.pdf", api.GetDatapath(), false);
    // Let Tesseract handle iteration, rendering and closing.
    api.ProcessPages("multi.tif", nullptr, 0, renderer);
    api.End();
    return 0;
}

The renderer implementation lives in [src/api/pagerenderer.cpp](https://github.com/tesseract-ocr/tesseract/blob/main/src/api/pagerenderer.cpp), and the factory functions are exposed in the C API (TessPDFRendererCreate, etc.).

Summary

  • Tesseract’s core engine processes single images, but the API layer in src/api/baseapi.cpp provides ProcessPages, ProcessPagesMultipageTiff, and ProcessPage to handle multi-page documents.
  • Input format detection happens in ProcessPagesInternal, which routes TIFFs to specialized multi-page handlers and single images directly to ProcessPage.
  • The per-page pipeline consists of SetImage, AnalyseLayout, Recognize, optional retry logic, and renderer output, all encapsulated in ProcessPage.
  • Manual iteration using pixReadFromMultipageTiff and ProcessPage gives you fine-grained control over memory, error handling, and custom output formats.
  • Renderers (TessPDFRenderer, etc.) automate multi-page output generation when passed to high-level methods.

Frequently Asked Questions

How does Tesseract handle memory between pages when processing multi-page documents?

Tesseract initializes the TessBaseAPI object once, loading language models and dictionaries into memory that persist across page iterations. For each page, SetImage copies the new Pix data into the internal Thresholder, while the previous page's image data is released via pixDestroy. The recognition results (PAGE_RES) are regenerated for each call to ProcessPage, ensuring no cross-page contamination while maintaining efficient memory reuse for the engine state.

Can I process PDF documents directly using the page-by-page API?

Tesseract does not natively parse PDF files; the ProcessPagesInternal logic treats PDFs as single images or relies on external conversion. To process PDFs page-by-page, you must first convert pages to images using tools like pdftoppm, pdf2image, or Leptonica's PDF functions, then feed each image to ProcessPage or iterate through them with ProcessPagesFileList. The renderer system can then recombine the OCR results into searchable PDFs via TessPDFRenderer.

What is the difference between ProcessPages and ProcessPage in the Tesseract API?

ProcessPages is a high-level convenience method that handles entire documents, automatically detecting input formats (single image, multi-page TIFF, or file list) and managing the iteration loop internally. In contrast, ProcessPage is the low-level primitive that performs OCR on a single Pix image, requiring the caller to handle pagination, memory management, and result retrieval manually. Use ProcessPages for standard batch processing and ProcessPage when you need custom error handling, per-page logic, or integration with external image sources.

How do I implement retry logic for failed page recognitions?

The ProcessPage method in src/api/baseapi.cpp supports automatic retry through the retry_config parameter. When provided, Tesseract saves the current variable set before recognition begins; if the first Recognize() call fails, it reloads the saved state, applies the alternative configuration from the retry file, and runs OCR again. For manual implementation, catch recognition failures after Recognize(), adjust tesseract::TessBaseAPI variables using SetVariable(), and re-run Recognize() without reinitializing the API object.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →