# How to Implement Page-by-Page OCR Processing for Multi-Page Documents with Tesseract

> Master page-by-page OCR processing for multi-page documents with Tesseract. Learn how to iterate through images using ProcessPage and leverage powerful helpers for seamless pagination.

- Repository: [tesseract-ocr/tesseract](https://github.com/tesseract-ocr/tesseract)
- Tags: how-to-guide
- Published: 2026-03-02

---

**Tesseract processes multi-page documents by iterating through individual images using the `ProcessPage` method, with high-level helpers like `ProcessPages` and `ProcessPagesMultipageTiff` in [`src/api/baseapi.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/api/baseapi.cpp) handling the pagination logic automatically.**

The Tesseract OCR engine is designed to process single images, but the public API provides robust support for page-by-page OCR processing for multi-page documents with Tesseract through thin wrapper methods that manage iteration and memory. Understanding these internal mechanisms allows developers to implement custom pagination, error handling, and output formatting while leveraging the optimized recognition pipeline in the `tesseract-ocr/tesseract` repository.

## How Tesseract Detects and Routes Multi-Page Input

The entry point `TessBaseAPI::ProcessPages` forwards to `ProcessPagesInternal` (lines 29‑31 of **[[`src/api/baseapi.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/api/baseapi.cpp)](https://github.com/tesseract-ocr/tesseract/blob/main/src/api/baseapi.cpp#L29)**). This internal method performs three mutually exclusive branches based on input format detection (lines 30‑44):

| Branch | Condition | Implementation |
|--------|-----------|----------------|
| **File-list (stdin)** | `stream_filelist` flag is true | Reads file names line‑by‑line and calls `ProcessPagesFileList` for each entry. |
| **Multipage TIFF** | `format` detected as `IFF_TIFF*` variant | Calls `ProcessPagesMultipageTiff`, which extracts each image page and invokes `ProcessPage`. |
| **Single image** | Any other format (PNG, JPEG, PDF, etc.) | Loads the image into a `Pix` and calls `ProcessPage` once. |

## The Per-Page OCR Pipeline in ProcessPage

`TessBaseAPI::ProcessPage` implements the actual OCR steps for one page (lines 95‑119 of **[[`baseapi.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/baseapi.cpp)](https://github.com/tesseract-ocr/tesseract/blob/main/src/api/baseapi.cpp#L95-L119)**):

1. **Set image** – `SetImage(pix)` copies the Leptonica `Pix` into the internal `Thresholder`.
2. **Run layout analysis** – `AnalyseLayout()` (or `FindLines()`) builds the spatial hierarchy (`PAGE_RES`).
3. **Run recognition** – `Recognize()` executes the neural‑network or legacy recognizer.
4. **Optional retry** – If a `retry_config` file is supplied and the first attempt fails, the method reloads the saved variable set, applies the retry configuration, and runs OCR again.
5. **Render output** – If a `TessResultRenderer` is passed, `renderer->AddImage(this)` writes the result in the requested format (hOCR, PDF, TSV, etc.).

All of this is wrapped by a C interface in **[[`src/api/capi.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/api/capi.cpp)](https://github.com/tesseract-ocr/tesseract/blob/main/src/api/capi.cpp)** – e.g. `TessBaseAPIProcessPage` (lines 400‑408) – so the same behaviour is available from C, C++, Python, Java, etc.

## Implementing Manual Page-by-Page Control

Because the high‑level `ProcessPages*` helpers are thin wrappers, you can drive the per‑page loop yourself:

```cpp
#include <tesseract/baseapi.h>
#include <leptonica/allheaders.h>

int main() {
    tesseract::TessBaseAPI api;
    api.Init("/usr/share/tesseract-ocr/4.00/tessdata", "eng");
    api.SetPageSegMode(tesseract::PSM_AUTO);

    const char *filename = "multi.tif";
    Pix *pix = nullptr;
    size_t offset = 0;  // Leptonica uses this to locate next page
    int page = 0;

    while ((pix = pixReadFromMultipageTiff(filename, &offset)) != nullptr) {
        api.ProcessPage(pix, page, filename, nullptr, 0, nullptr);
        char *utf8 = api.GetUTF8Text();
        printf("=== Page %d ===\n%s\n", page + 1, utf8);
        delete [] utf8;
        pixDestroy(&pix);
        if (!offset) break;  // no more pages
        ++page;
    }
    api.End();
    return 0;
}

```

* The loop mirrors the implementation in `ProcessPagesMultipageTiff` (lines 54‑86 of **[[`baseapi.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/baseapi.cpp)](https://github.com/tesseract-ocr/tesseract/blob/main/src/api/baseapi.cpp#L54-L86)**).
* `offset` is the internal byte offset that Leptonica returns for the next page; when it becomes zero the TIFF is exhausted.

## Using the C API for Language Bindings

If you prefer the C API (e.g. from Python’s `ctypes`), the equivalent call is:

```python
import ctypes
from ctypes import c_char_p, c_int, c_void_p, c_size_t, byref

lib = ctypes.CDLL('libtesseract.so')

# Function signatures

lib.TessBaseAPICreate.restype = c_void_p
lib.TessBaseAPIInit3.argtypes = [c_void_p, c_char_p, c_char_p]
lib.TessBaseAPIProcessPage.argtypes = [c_void_p, c_void_p, c_int,
                                      c_char_p, c_char_p, c_int, c_void_p]
lib.TessBaseAPIGetUTF8Text.restype = c_char_p

handle = lib.TessBaseAPICreate()
datapath = b"/usr/share/tesseract-ocr/4.00/tessdata"
lib.TessBaseAPIInit3(handle, datapath, b"eng")

filename = b"multi.tif"
page = 0
offset = c_size_t(0)

while True:
    # Assuming leptonica functions are also wrapped

    pix = lib.pixReadFromMultipageTiff(filename, byref(offset))
    if not pix:
        break
    lib.TessBaseAPIProcessPage(handle, pix, page, filename, None, 0, None)
    txt = lib.TessBaseAPIGetUTF8Text(handle)
    print(f"--- Page {page+1} ---\n{ctypes.string_at(txt).decode()}")
    lib.TessDeleteText(txt)
    lib.pixDestroy(byref(pix))
    if offset.value == 0:
        break
    page += 1

lib.TessBaseAPIEnd(handle)
lib.TessBaseAPIDelete(handle)

```

(See `TessBaseAPIProcessPage` in **[[`capi.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/capi.cpp)](https://github.com/tesseract-ocr/tesseract/blob/main/src/api/capi.cpp#L404-L408)**.)

## Automatic Multi-Page Output with Renderers

If you hand a `TessResultRenderer` to `ProcessPages` (or `ProcessPage`) you get automatic multi‑page output in PDF, hOCR, TSV, etc., without writing any loop logic:

```cpp
#include <tesseract/baseapi.h>
#include <tesseract/renderer.h>

int main() {
    tesseract::TessBaseAPI api;
    api.Init(nullptr, "eng");
    // Create a PDF renderer that writes to "out.pdf"
    tesseract::TessResultRenderer *renderer =
        TessPDFRendererCreate("out.pdf", api.GetDatapath(), false);
    // Let Tesseract handle iteration, rendering and closing.
    api.ProcessPages("multi.tif", nullptr, 0, renderer);
    api.End();
    return 0;
}

```

The renderer implementation lives in **[[`src/api/pagerenderer.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/api/pagerenderer.cpp)](https://github.com/tesseract-ocr/tesseract/blob/main/src/api/pagerenderer.cpp)**, and the factory functions are exposed in the C API (`TessPDFRendererCreate`, etc.).

## Summary

- **Tesseract’s core engine processes single images**, but the API layer in [`src/api/baseapi.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/api/baseapi.cpp) provides `ProcessPages`, `ProcessPagesMultipageTiff`, and `ProcessPage` to handle multi-page documents.
- **Input format detection** happens in `ProcessPagesInternal`, which routes TIFFs to specialized multi-page handlers and single images directly to `ProcessPage`.
- **The per-page pipeline** consists of `SetImage`, `AnalyseLayout`, `Recognize`, optional retry logic, and renderer output, all encapsulated in `ProcessPage`.
- **Manual iteration** using `pixReadFromMultipageTiff` and `ProcessPage` gives you fine-grained control over memory, error handling, and custom output formats.
- **Renderers** (`TessPDFRenderer`, etc.) automate multi-page output generation when passed to high-level methods.

## Frequently Asked Questions

### How does Tesseract handle memory between pages when processing multi-page documents?

Tesseract initializes the `TessBaseAPI` object once, loading language models and dictionaries into memory that persist across page iterations. For each page, `SetImage` copies the new `Pix` data into the internal `Thresholder`, while the previous page's image data is released via `pixDestroy`. The recognition results (`PAGE_RES`) are regenerated for each call to `ProcessPage`, ensuring no cross-page contamination while maintaining efficient memory reuse for the engine state.

### Can I process PDF documents directly using the page-by-page API?

Tesseract does not natively parse PDF files; the `ProcessPagesInternal` logic treats PDFs as single images or relies on external conversion. To process PDFs page-by-page, you must first convert pages to images using tools like `pdftoppm`, `pdf2image`, or Leptonica's PDF functions, then feed each image to `ProcessPage` or iterate through them with `ProcessPagesFileList`. The renderer system can then recombine the OCR results into searchable PDFs via `TessPDFRenderer`.

### What is the difference between `ProcessPages` and `ProcessPage` in the Tesseract API?

`ProcessPages` is a high-level convenience method that handles entire documents, automatically detecting input formats (single image, multi-page TIFF, or file list) and managing the iteration loop internally. In contrast, `ProcessPage` is the low-level primitive that performs OCR on a single `Pix` image, requiring the caller to handle pagination, memory management, and result retrieval manually. Use `ProcessPages` for standard batch processing and `ProcessPage` when you need custom error handling, per-page logic, or integration with external image sources.

### How do I implement retry logic for failed page recognitions?

The `ProcessPage` method in [`src/api/baseapi.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/api/baseapi.cpp) supports automatic retry through the `retry_config` parameter. When provided, Tesseract saves the current variable set before recognition begins; if the first `Recognize()` call fails, it reloads the saved state, applies the alternative configuration from the retry file, and runs OCR again. For manual implementation, catch recognition failures after `Recognize()`, adjust `tesseract::TessBaseAPI` variables using `SetVariable()`, and re-run `Recognize()` without reinitializing the API object.