How to Build Searchable PDFs with Tesseract: A Complete Guide to TessPDFRenderer

Tesseract generates searchable PDFs by using the TessPDFRenderer class to embed an invisible text layer over raster images, creating PDFs that contain both the original image and selectable text without requiring external PDF libraries.

The tesseract-ocr/tesseract repository includes a native PDF rendering pipeline that converts scanned documents into searchable PDF/A files. This functionality is implemented through the TessPDFRenderer class, which streams PDF objects directly to disk while embedding a hidden text layer derived from OCR results. Understanding this architecture enables developers to build searchable PDFs with Tesseract via command-line interfaces, C++ APIs, or automated scripts.

How Tesseract Generates Searchable PDFs

Tesseract’s PDF generation relies on the TessPDFRenderer class, which inherits from the abstract TessResultRenderer interface defined in include/tesseract/renderer.h. This design follows a handler pattern where the renderer processes OCR results page-by-page and streams PDF objects to an output file.

The renderer overrides three critical virtual hooks from the base class:

  • BeginDocumentHandler() – Initializes the PDF structure, writes the header, and establishes the document catalog.
  • AddImageHandler() – Processes each input image, optionally embeds the raster data, and generates the invisible text layer using OCR word bounding boxes.
  • EndDocumentHandler() – Finalizes the cross-reference table, writes the trailer, and closes the file stream.

This architecture allows Tesseract to produce PDFs without linking external libraries like LibPDF, keeping the dependency footprint minimal.

Configuration and Renderer Selection

Tesseract activates PDF generation through the boolean configuration variable tessedit_create_pdf. When this variable is set to 1, the main driver instantiates TessPDFRenderer instead of the default text renderer.

The configuration flow works as follows:

  1. Config File – The file tessdata/configs/pdf contains the line tessedit_create_pdf 1, allowing users to trigger PDF mode by passing pdf as a config argument.

  2. Driver Logic – In src/tesseract.cpp (around lines 562–570), the code checks the variable using api.GetBoolVariable("tessedit_create_pdf", &b). When true, it constructs the renderer:

    renderer = new tesseract::TessPDFRenderer(output_base, api.GetDatapath(), textonly);
  3. Text-Only Mode – The optional boolean textonly_pdf (set via -c textonly_pdf=1) instructs the renderer to omit images and include only the invisible text layer, significantly reducing file size.

The PDF Rendering Pipeline

The TessPDFRenderer implementation in src/api/pdfrenderer.cpp handles the complex task of generating valid PDF/A documents while streaming content to minimize memory usage.

Document Lifecycle Hooks

The renderer manages PDF state through three overridden methods:

  • BeginDocumentHandler() writes the PDF header (%PDF-1.5), creates the document catalog, and initializes the object counter (obj_) and offset tracker (offsets_).

  • AddImageHandler() processes each page sequentially. It retrieves the OCR results from the TessBaseAPI object and coordinates image embedding with text layer generation.

  • EndDocumentHandler() writes the cross-reference table using the stored offsets_, followed by the trailer dictionary containing the /Root and /Size entries.

Object Management and Streaming

To maintain low memory footprint when processing multi-page documents, the renderer uses a streaming architecture:

  • obj_ – An integer counter tracking PDF object numbers as they are written sequentially to the output stream.
  • offsets_ – A vector storing byte positions of each object, enabling the cross-reference table to be built at the end of the file without random access seeks.

This design allows Tesseract to process arbitrarily large documents without loading all pages into memory simultaneously.

Invisible Text Layer Generation

The searchable capability relies on embedding invisible text positioned precisely over the original image content:

  • GetPDFTextObjects() – Converts OCR word bounding boxes into PDF text objects using the syntax BT /F1 12 Tf x y Td (text) Tj ET, where BT/ET mark begin/end text objects, and Td positions the text using the word coordinates from the OCR engine.

  • Custom Font – The renderer embeds a minimal TrueType font defined in src/api/pdf_ttf.h (the pdf.ttf binary data). This ensures the hidden text is selectable in any PDF viewer without external font dependencies.

Image Embedding

For each processed page, the renderer optionally embeds the source raster image:

  • imageToPDFObj() – Converts the Leptonica Pix structure into a PDF XObject, compressing the image as JPEG for photographs or PNG for bitonal content. The function streams the compressed image data directly into the PDF object stream, avoiding intermediate file creation.

When the textonly_pdf flag is enabled, this step is skipped, producing a document containing only the invisible text layer.

Practical Implementation Examples

Command-Line Usage

The simplest method to build searchable PDFs with Tesseract uses the built-in pdf configuration:


# Generate searchable PDF from image

tesseract input.png output -l eng pdf

# Explicitly enable PDF creation via configuration variable

tesseract input.png output -c tessedit_create_pdf=1

# Create text-only PDF (smaller file, no images)

tesseract input.png output -c textonly_pdf=1

The pdf argument loads the configuration file tessdata/configs/pdf, which sets tessedit_create_pdf=1 automatically.

C++ API Integration

For applications requiring programmatic control, instantiate TessPDFRenderer directly after initializing the API:

#include <tesseract/baseapi.h>
#include <tesseract/renderer.h>
#include <leptonica/allheaders.h>

int main() {
  tesseract::TessBaseAPI api;
  api.Init(nullptr, "eng");
  api.SetVariable("tessedit_create_pdf", "1");
  api.SetVariable("textonly_pdf", "0");

  Pix *image = pixRead("input.png");
  api.SetImage(image);

  // Construct renderer: output base name, data path for fonts, textonly flag
  std::unique_ptr<tesseract::TessPDFRenderer> renderer(
      new tesseract::TessPDFRenderer("output", api.GetDatapath(), false));

  api.Recognize(nullptr);
  renderer->BeginDocument("OCR Document");
  renderer->AddImage(&api);
  renderer->EndDocument();

  pixDestroy(&image);
  return 0;
}

This approach gives direct access to the TessPDFRenderer constructor defined in include/tesseract/renderer.h, allowing control over output paths and font data locations.

Python Automation

While pytesseract does not expose the renderer directly, you can invoke the Tesseract binary with PDF parameters:

import subprocess
import os

def create_searchable_pdf(image_path, output_base, lang="eng", textonly=False):
    """
    Convert image to searchable PDF using Tesseract CLI.
    """
    cmd = [
        "tesseract",
        image_path,
        output_base,
        "-l", lang,
        "pdf"
    ]
    
    if textonly:
        cmd.extend(["-c", "textonly_pdf=1"])
    
    subprocess.run(cmd, check=True)
    return f"{output_base}.pdf"

# Usage

create_searchable_pdf("scan.png", "document", textonly=False)

This script leverages the tessedit_create_pdf configuration triggered by the pdf argument, producing standard PDF/A documents suitable for archiving.

Key Source Files and Architecture Reference

The searchable PDF functionality is implemented across the following source files:

  • include/tesseract/renderer.h – Declares the TessResultRenderer abstract base class and the TessPDFRenderer implementation, including constructor signatures and virtual hook methods.

  • src/api/pdfrenderer.cpp – Contains the complete PDF generation logic, including TessPDFRenderer::BeginDocumentHandler(), AddImageHandler(), and EndDocumentHandler(), plus helper functions GetPDFTextObjects() and imageToPDFObj().

  • src/tesseract.cpp – Main driver application that checks the tessedit_create_pdf boolean variable (around lines 562–570) and conditionally instantiates TessPDFRenderer.

  • tessdata/configs/pdf – Configuration file that sets tessedit_create_pdf 1, enabling PDF output mode when specified on the command line.

  • src/api/pdf_ttf.h – Header file containing the embedded binary data for pdf.ttf, the custom TrueType font used for the invisible text layer.

These components work together to stream PDF objects directly to disk, eliminating dependencies on external PDF libraries while maintaining compliance with PDF/A standards for searchable documents.

Summary

  • Tesseract builds searchable PDFs natively using the TessPDFRenderer class, which implements the TessResultRenderer interface without requiring external PDF libraries.
  • Configuration is controlled by the tessedit_create_pdf variable, typically set via the pdf config file or command-line flags.
  • The rendering pipeline uses three lifecycle hooks (BeginDocumentHandler, AddImageHandler, EndDocumentHandler) to stream PDF objects, manage cross-reference tables, and embed both images and invisible text layers.
  • Text searchability is achieved by positioning OCR word boxes as invisible PDF text objects using a custom embedded font (pdf.ttf) defined in src/api/pdf_ttf.h.
  • Implementation options include direct command-line usage, C++ API integration via include/tesseract/renderer.h, or wrapper scripts that invoke the Tesseract binary.

Frequently Asked Questions

Does Tesseract require LibPDF or external libraries to create searchable PDFs?

No. Tesseract generates searchable PDFs using its own native TessPDFRenderer class implemented in src/api/pdfrenderer.cpp. The renderer streams PDF objects directly to disk and manages the document structure internally, eliminating dependencies on external libraries like LibPDF or PDFium. The only requirement is the Tesseract training data directory, which provides the embedded font file (pdf.ttf) used for the invisible text layer.

What is the difference between a regular PDF and a searchable PDF created by Tesseract?

A regular PDF contains only raster image data, making text non-selectable and non-indexable. A searchable PDF created by Tesseract contains both the original image (unless textonly_pdf is enabled) and an invisible text layer positioned precisely over the image content. This hidden layer uses standard PDF text operators (BT/ET blocks) with coordinates derived from OCR word bounding boxes, allowing users to select, copy, and search the text while viewing the original scanned image.

How can I create a text-only PDF without images using Tesseract?

Set the boolean configuration variable textonly_pdf to 1 before processing. This can be done via the command line using -c textonly_pdf=1 or programmatically via the C++ API using api.SetVariable("textonly_pdf", "1"). When enabled, the TessPDFRenderer skips the imageToPDFObj step in AddImageHandler(), omitting the raster image data and including only the invisible text layer. This significantly reduces file size but removes the visual representation of the original document.

What font does Tesseract use for the invisible text layer in PDFs?

Tesseract embeds a custom minimal TrueType font named pdf.ttf, which is compiled directly into the binary as a byte array in src/api/pdf_ttf.h. This font is used exclusively for the invisible text layer to ensure text selectability across all PDF viewers without relying on system fonts. The renderer references this font in PDF objects as /F1 within text blocks, allowing the hidden text to occupy the exact coordinates determined by the OCR engine during recognition.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →