# How to Build Searchable PDFs with Tesseract: A Complete Guide to TessPDFRenderer

> Learn how to build searchable PDFs with Tesseract using TessPDFRenderer. Embed invisible text layers over images for selectable text without external libraries. Complete guide.

- Repository: [tesseract-ocr/tesseract](https://github.com/tesseract-ocr/tesseract)
- Tags: how-to-guide
- Published: 2026-03-02

---

**Tesseract generates searchable PDFs by using the `TessPDFRenderer` class to embed an invisible text layer over raster images, creating PDFs that contain both the original image and selectable text without requiring external PDF libraries.**

The tesseract-ocr/tesseract repository includes a native PDF rendering pipeline that converts scanned documents into searchable PDF/A files. This functionality is implemented through the `TessPDFRenderer` class, which streams PDF objects directly to disk while embedding a hidden text layer derived from OCR results. Understanding this architecture enables developers to build searchable PDFs with Tesseract via command-line interfaces, C++ APIs, or automated scripts.

## How Tesseract Generates Searchable PDFs

Tesseract’s PDF generation relies on the **`TessPDFRenderer`** class, which inherits from the abstract **`TessResultRenderer`** interface defined in [`include/tesseract/renderer.h`](https://github.com/tesseract-ocr/tesseract/blob/main/include/tesseract/renderer.h). This design follows a handler pattern where the renderer processes OCR results page-by-page and streams PDF objects to an output file.

The renderer overrides three critical virtual hooks from the base class:

- **`BeginDocumentHandler()`** – Initializes the PDF structure, writes the header, and establishes the document catalog.
- **`AddImageHandler()`** – Processes each input image, optionally embeds the raster data, and generates the invisible text layer using OCR word bounding boxes.
- **`EndDocumentHandler()`** – Finalizes the cross-reference table, writes the trailer, and closes the file stream.

This architecture allows Tesseract to produce PDFs without linking external libraries like LibPDF, keeping the dependency footprint minimal.

## Configuration and Renderer Selection

Tesseract activates PDF generation through the boolean configuration variable **`tessedit_create_pdf`**. When this variable is set to `1`, the main driver instantiates `TessPDFRenderer` instead of the default text renderer.

The configuration flow works as follows:

1. **Config File** – The file `tessdata/configs/pdf` contains the line `tessedit_create_pdf 1`, allowing users to trigger PDF mode by passing `pdf` as a config argument.

2. **Driver Logic** – In [`src/tesseract.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/tesseract.cpp) (around lines 562–570), the code checks the variable using `api.GetBoolVariable("tessedit_create_pdf", &b)`. When true, it constructs the renderer:

   ```cpp
   renderer = new tesseract::TessPDFRenderer(output_base, api.GetDatapath(), textonly);
   ```

3. **Text-Only Mode** – The optional boolean **`textonly_pdf`** (set via `-c textonly_pdf=1`) instructs the renderer to omit images and include only the invisible text layer, significantly reducing file size.

## The PDF Rendering Pipeline

The `TessPDFRenderer` implementation in [`src/api/pdfrenderer.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/api/pdfrenderer.cpp) handles the complex task of generating valid PDF/A documents while streaming content to minimize memory usage.

### Document Lifecycle Hooks

The renderer manages PDF state through three overridden methods:

- **`BeginDocumentHandler()`** writes the PDF header (`%PDF-1.5`), creates the document catalog, and initializes the object counter (`obj_`) and offset tracker (`offsets_`).

- **`AddImageHandler()`** processes each page sequentially. It retrieves the OCR results from the `TessBaseAPI` object and coordinates image embedding with text layer generation.

- **`EndDocumentHandler()`** writes the cross-reference table using the stored `offsets_`, followed by the trailer dictionary containing the `/Root` and `/Size` entries.

### Object Management and Streaming

To maintain low memory footprint when processing multi-page documents, the renderer uses a streaming architecture:

- **`obj_`** – An integer counter tracking PDF object numbers as they are written sequentially to the output stream.
- **`offsets_`** – A vector storing byte positions of each object, enabling the cross-reference table to be built at the end of the file without random access seeks.

This design allows Tesseract to process arbitrarily large documents without loading all pages into memory simultaneously.

### Invisible Text Layer Generation

The searchable capability relies on embedding invisible text positioned precisely over the original image content:

- **`GetPDFTextObjects()`** – Converts OCR word bounding boxes into PDF text objects using the syntax `BT /F1 12 Tf x y Td (text) Tj ET`, where `BT`/`ET` mark begin/end text objects, and `Td` positions the text using the word coordinates from the OCR engine.

- **Custom Font** – The renderer embeds a minimal TrueType font defined in [`src/api/pdf_ttf.h`](https://github.com/tesseract-ocr/tesseract/blob/main/src/api/pdf_ttf.h) (the `pdf.ttf` binary data). This ensures the hidden text is selectable in any PDF viewer without external font dependencies.

### Image Embedding

For each processed page, the renderer optionally embeds the source raster image:

- **`imageToPDFObj()`** – Converts the Leptonica `Pix` structure into a PDF XObject, compressing the image as JPEG for photographs or PNG for bitonal content. The function streams the compressed image data directly into the PDF object stream, avoiding intermediate file creation.

When the `textonly_pdf` flag is enabled, this step is skipped, producing a document containing only the invisible text layer.

## Practical Implementation Examples

### Command-Line Usage

The simplest method to build searchable PDFs with Tesseract uses the built-in `pdf` configuration:

```bash

# Generate searchable PDF from image

tesseract input.png output -l eng pdf

# Explicitly enable PDF creation via configuration variable

tesseract input.png output -c tessedit_create_pdf=1

# Create text-only PDF (smaller file, no images)

tesseract input.png output -c textonly_pdf=1

```

The `pdf` argument loads the configuration file `tessdata/configs/pdf`, which sets `tessedit_create_pdf=1` automatically.

### C++ API Integration

For applications requiring programmatic control, instantiate `TessPDFRenderer` directly after initializing the API:

```cpp
#include <tesseract/baseapi.h>
#include <tesseract/renderer.h>
#include <leptonica/allheaders.h>

int main() {
  tesseract::TessBaseAPI api;
  api.Init(nullptr, "eng");
  api.SetVariable("tessedit_create_pdf", "1");
  api.SetVariable("textonly_pdf", "0");

  Pix *image = pixRead("input.png");
  api.SetImage(image);

  // Construct renderer: output base name, data path for fonts, textonly flag
  std::unique_ptr<tesseract::TessPDFRenderer> renderer(
      new tesseract::TessPDFRenderer("output", api.GetDatapath(), false));

  api.Recognize(nullptr);
  renderer->BeginDocument("OCR Document");
  renderer->AddImage(&api);
  renderer->EndDocument();

  pixDestroy(&image);
  return 0;
}

```

This approach gives direct access to the `TessPDFRenderer` constructor defined in [`include/tesseract/renderer.h`](https://github.com/tesseract-ocr/tesseract/blob/main/include/tesseract/renderer.h), allowing control over output paths and font data locations.

### Python Automation

While `pytesseract` does not expose the renderer directly, you can invoke the Tesseract binary with PDF parameters:

```python
import subprocess
import os

def create_searchable_pdf(image_path, output_base, lang="eng", textonly=False):
    """
    Convert image to searchable PDF using Tesseract CLI.
    """
    cmd = [
        "tesseract",
        image_path,
        output_base,
        "-l", lang,
        "pdf"
    ]
    
    if textonly:
        cmd.extend(["-c", "textonly_pdf=1"])
    
    subprocess.run(cmd, check=True)
    return f"{output_base}.pdf"

# Usage

create_searchable_pdf("scan.png", "document", textonly=False)

```

This script leverages the `tessedit_create_pdf` configuration triggered by the `pdf` argument, producing standard PDF/A documents suitable for archiving.

## Key Source Files and Architecture Reference

The searchable PDF functionality is implemented across the following source files:

- **[`include/tesseract/renderer.h`](https://github.com/tesseract-ocr/tesseract/blob/main/include/tesseract/renderer.h)** – Declares the `TessResultRenderer` abstract base class and the `TessPDFRenderer` implementation, including constructor signatures and virtual hook methods.

- **[`src/api/pdfrenderer.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/api/pdfrenderer.cpp)** – Contains the complete PDF generation logic, including `TessPDFRenderer::BeginDocumentHandler()`, `AddImageHandler()`, and `EndDocumentHandler()`, plus helper functions `GetPDFTextObjects()` and `imageToPDFObj()`.

- **[`src/tesseract.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/tesseract.cpp)** – Main driver application that checks the `tessedit_create_pdf` boolean variable (around lines 562–570) and conditionally instantiates `TessPDFRenderer`.

- **`tessdata/configs/pdf`** – Configuration file that sets `tessedit_create_pdf 1`, enabling PDF output mode when specified on the command line.

- **[`src/api/pdf_ttf.h`](https://github.com/tesseract-ocr/tesseract/blob/main/src/api/pdf_ttf.h)** – Header file containing the embedded binary data for `pdf.ttf`, the custom TrueType font used for the invisible text layer.

These components work together to stream PDF objects directly to disk, eliminating dependencies on external PDF libraries while maintaining compliance with PDF/A standards for searchable documents.

## Summary

- **Tesseract builds searchable PDFs natively** using the `TessPDFRenderer` class, which implements the `TessResultRenderer` interface without requiring external PDF libraries.
- **Configuration is controlled** by the `tessedit_create_pdf` variable, typically set via the `pdf` config file or command-line flags.
- **The rendering pipeline** uses three lifecycle hooks (`BeginDocumentHandler`, `AddImageHandler`, `EndDocumentHandler`) to stream PDF objects, manage cross-reference tables, and embed both images and invisible text layers.
- **Text searchability** is achieved by positioning OCR word boxes as invisible PDF text objects using a custom embedded font (`pdf.ttf`) defined in [`src/api/pdf_ttf.h`](https://github.com/tesseract-ocr/tesseract/blob/main/src/api/pdf_ttf.h).
- **Implementation options** include direct command-line usage, C++ API integration via [`include/tesseract/renderer.h`](https://github.com/tesseract-ocr/tesseract/blob/main/include/tesseract/renderer.h), or wrapper scripts that invoke the Tesseract binary.

## Frequently Asked Questions

### Does Tesseract require LibPDF or external libraries to create searchable PDFs?

No. Tesseract generates searchable PDFs using its own native `TessPDFRenderer` class implemented in [`src/api/pdfrenderer.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/api/pdfrenderer.cpp). The renderer streams PDF objects directly to disk and manages the document structure internally, eliminating dependencies on external libraries like LibPDF or PDFium. The only requirement is the Tesseract training data directory, which provides the embedded font file (`pdf.ttf`) used for the invisible text layer.

### What is the difference between a regular PDF and a searchable PDF created by Tesseract?

A regular PDF contains only raster image data, making text non-selectable and non-indexable. A searchable PDF created by Tesseract contains both the original image (unless `textonly_pdf` is enabled) and an invisible text layer positioned precisely over the image content. This hidden layer uses standard PDF text operators (`BT`/`ET` blocks) with coordinates derived from OCR word bounding boxes, allowing users to select, copy, and search the text while viewing the original scanned image.

### How can I create a text-only PDF without images using Tesseract?

Set the boolean configuration variable `textonly_pdf` to `1` before processing. This can be done via the command line using `-c textonly_pdf=1` or programmatically via the C++ API using `api.SetVariable("textonly_pdf", "1")`. When enabled, the `TessPDFRenderer` skips the `imageToPDFObj` step in `AddImageHandler()`, omitting the raster image data and including only the invisible text layer. This significantly reduces file size but removes the visual representation of the original document.

### What font does Tesseract use for the invisible text layer in PDFs?

Tesseract embeds a custom minimal TrueType font named `pdf.ttf`, which is compiled directly into the binary as a byte array in [`src/api/pdf_ttf.h`](https://github.com/tesseract-ocr/tesseract/blob/main/src/api/pdf_ttf.h). This font is used exclusively for the invisible text layer to ensure text selectability across all PDF viewers without relying on system fonts. The renderer references this font in PDF objects as `/F1` within text blocks, allowing the hidden text to occupy the exact coordinates determined by the OCR engine during recognition.