How to Build Searchable PDFs with Tesseract: A Complete Guide to TessPDFRenderer
Tesseract generates searchable PDFs by using the TessPDFRenderer class to embed an invisible text layer over raster images, creating PDFs that contain both the original image and selectable text without requiring external PDF libraries.
The tesseract-ocr/tesseract repository includes a native PDF rendering pipeline that converts scanned documents into searchable PDF/A files. This functionality is implemented through the TessPDFRenderer class, which streams PDF objects directly to disk while embedding a hidden text layer derived from OCR results. Understanding this architecture enables developers to build searchable PDFs with Tesseract via command-line interfaces, C++ APIs, or automated scripts.
How Tesseract Generates Searchable PDFs
Tesseract’s PDF generation relies on the TessPDFRenderer class, which inherits from the abstract TessResultRenderer interface defined in include/tesseract/renderer.h. This design follows a handler pattern where the renderer processes OCR results page-by-page and streams PDF objects to an output file.
The renderer overrides three critical virtual hooks from the base class:
BeginDocumentHandler()– Initializes the PDF structure, writes the header, and establishes the document catalog.AddImageHandler()– Processes each input image, optionally embeds the raster data, and generates the invisible text layer using OCR word bounding boxes.EndDocumentHandler()– Finalizes the cross-reference table, writes the trailer, and closes the file stream.
This architecture allows Tesseract to produce PDFs without linking external libraries like LibPDF, keeping the dependency footprint minimal.
Configuration and Renderer Selection
Tesseract activates PDF generation through the boolean configuration variable tessedit_create_pdf. When this variable is set to 1, the main driver instantiates TessPDFRenderer instead of the default text renderer.
The configuration flow works as follows:
-
Config File – The file
tessdata/configs/pdfcontains the linetessedit_create_pdf 1, allowing users to trigger PDF mode by passingpdfas a config argument. -
Driver Logic – In
src/tesseract.cpp(around lines 562–570), the code checks the variable usingapi.GetBoolVariable("tessedit_create_pdf", &b). When true, it constructs the renderer:renderer = new tesseract::TessPDFRenderer(output_base, api.GetDatapath(), textonly); -
Text-Only Mode – The optional boolean
textonly_pdf(set via-c textonly_pdf=1) instructs the renderer to omit images and include only the invisible text layer, significantly reducing file size.
The PDF Rendering Pipeline
The TessPDFRenderer implementation in src/api/pdfrenderer.cpp handles the complex task of generating valid PDF/A documents while streaming content to minimize memory usage.
Document Lifecycle Hooks
The renderer manages PDF state through three overridden methods:
-
BeginDocumentHandler()writes the PDF header (%PDF-1.5), creates the document catalog, and initializes the object counter (obj_) and offset tracker (offsets_). -
AddImageHandler()processes each page sequentially. It retrieves the OCR results from theTessBaseAPIobject and coordinates image embedding with text layer generation. -
EndDocumentHandler()writes the cross-reference table using the storedoffsets_, followed by the trailer dictionary containing the/Rootand/Sizeentries.
Object Management and Streaming
To maintain low memory footprint when processing multi-page documents, the renderer uses a streaming architecture:
obj_– An integer counter tracking PDF object numbers as they are written sequentially to the output stream.offsets_– A vector storing byte positions of each object, enabling the cross-reference table to be built at the end of the file without random access seeks.
This design allows Tesseract to process arbitrarily large documents without loading all pages into memory simultaneously.
Invisible Text Layer Generation
The searchable capability relies on embedding invisible text positioned precisely over the original image content:
-
GetPDFTextObjects()– Converts OCR word bounding boxes into PDF text objects using the syntaxBT /F1 12 Tf x y Td (text) Tj ET, whereBT/ETmark begin/end text objects, andTdpositions the text using the word coordinates from the OCR engine. -
Custom Font – The renderer embeds a minimal TrueType font defined in
src/api/pdf_ttf.h(thepdf.ttfbinary data). This ensures the hidden text is selectable in any PDF viewer without external font dependencies.
Image Embedding
For each processed page, the renderer optionally embeds the source raster image:
imageToPDFObj()– Converts the LeptonicaPixstructure into a PDF XObject, compressing the image as JPEG for photographs or PNG for bitonal content. The function streams the compressed image data directly into the PDF object stream, avoiding intermediate file creation.
When the textonly_pdf flag is enabled, this step is skipped, producing a document containing only the invisible text layer.
Practical Implementation Examples
Command-Line Usage
The simplest method to build searchable PDFs with Tesseract uses the built-in pdf configuration:
# Generate searchable PDF from image
tesseract input.png output -l eng pdf
# Explicitly enable PDF creation via configuration variable
tesseract input.png output -c tessedit_create_pdf=1
# Create text-only PDF (smaller file, no images)
tesseract input.png output -c textonly_pdf=1
The pdf argument loads the configuration file tessdata/configs/pdf, which sets tessedit_create_pdf=1 automatically.
C++ API Integration
For applications requiring programmatic control, instantiate TessPDFRenderer directly after initializing the API:
#include <tesseract/baseapi.h>
#include <tesseract/renderer.h>
#include <leptonica/allheaders.h>
int main() {
tesseract::TessBaseAPI api;
api.Init(nullptr, "eng");
api.SetVariable("tessedit_create_pdf", "1");
api.SetVariable("textonly_pdf", "0");
Pix *image = pixRead("input.png");
api.SetImage(image);
// Construct renderer: output base name, data path for fonts, textonly flag
std::unique_ptr<tesseract::TessPDFRenderer> renderer(
new tesseract::TessPDFRenderer("output", api.GetDatapath(), false));
api.Recognize(nullptr);
renderer->BeginDocument("OCR Document");
renderer->AddImage(&api);
renderer->EndDocument();
pixDestroy(&image);
return 0;
}
This approach gives direct access to the TessPDFRenderer constructor defined in include/tesseract/renderer.h, allowing control over output paths and font data locations.
Python Automation
While pytesseract does not expose the renderer directly, you can invoke the Tesseract binary with PDF parameters:
import subprocess
import os
def create_searchable_pdf(image_path, output_base, lang="eng", textonly=False):
"""
Convert image to searchable PDF using Tesseract CLI.
"""
cmd = [
"tesseract",
image_path,
output_base,
"-l", lang,
"pdf"
]
if textonly:
cmd.extend(["-c", "textonly_pdf=1"])
subprocess.run(cmd, check=True)
return f"{output_base}.pdf"
# Usage
create_searchable_pdf("scan.png", "document", textonly=False)
This script leverages the tessedit_create_pdf configuration triggered by the pdf argument, producing standard PDF/A documents suitable for archiving.
Key Source Files and Architecture Reference
The searchable PDF functionality is implemented across the following source files:
-
include/tesseract/renderer.h– Declares theTessResultRendererabstract base class and theTessPDFRendererimplementation, including constructor signatures and virtual hook methods. -
src/api/pdfrenderer.cpp– Contains the complete PDF generation logic, includingTessPDFRenderer::BeginDocumentHandler(),AddImageHandler(), andEndDocumentHandler(), plus helper functionsGetPDFTextObjects()andimageToPDFObj(). -
src/tesseract.cpp– Main driver application that checks thetessedit_create_pdfboolean variable (around lines 562–570) and conditionally instantiatesTessPDFRenderer. -
tessdata/configs/pdf– Configuration file that setstessedit_create_pdf 1, enabling PDF output mode when specified on the command line. -
src/api/pdf_ttf.h– Header file containing the embedded binary data forpdf.ttf, the custom TrueType font used for the invisible text layer.
These components work together to stream PDF objects directly to disk, eliminating dependencies on external PDF libraries while maintaining compliance with PDF/A standards for searchable documents.
Summary
- Tesseract builds searchable PDFs natively using the
TessPDFRendererclass, which implements theTessResultRendererinterface without requiring external PDF libraries. - Configuration is controlled by the
tessedit_create_pdfvariable, typically set via thepdfconfig file or command-line flags. - The rendering pipeline uses three lifecycle hooks (
BeginDocumentHandler,AddImageHandler,EndDocumentHandler) to stream PDF objects, manage cross-reference tables, and embed both images and invisible text layers. - Text searchability is achieved by positioning OCR word boxes as invisible PDF text objects using a custom embedded font (
pdf.ttf) defined insrc/api/pdf_ttf.h. - Implementation options include direct command-line usage, C++ API integration via
include/tesseract/renderer.h, or wrapper scripts that invoke the Tesseract binary.
Frequently Asked Questions
Does Tesseract require LibPDF or external libraries to create searchable PDFs?
No. Tesseract generates searchable PDFs using its own native TessPDFRenderer class implemented in src/api/pdfrenderer.cpp. The renderer streams PDF objects directly to disk and manages the document structure internally, eliminating dependencies on external libraries like LibPDF or PDFium. The only requirement is the Tesseract training data directory, which provides the embedded font file (pdf.ttf) used for the invisible text layer.
What is the difference between a regular PDF and a searchable PDF created by Tesseract?
A regular PDF contains only raster image data, making text non-selectable and non-indexable. A searchable PDF created by Tesseract contains both the original image (unless textonly_pdf is enabled) and an invisible text layer positioned precisely over the image content. This hidden layer uses standard PDF text operators (BT/ET blocks) with coordinates derived from OCR word bounding boxes, allowing users to select, copy, and search the text while viewing the original scanned image.
How can I create a text-only PDF without images using Tesseract?
Set the boolean configuration variable textonly_pdf to 1 before processing. This can be done via the command line using -c textonly_pdf=1 or programmatically via the C++ API using api.SetVariable("textonly_pdf", "1"). When enabled, the TessPDFRenderer skips the imageToPDFObj step in AddImageHandler(), omitting the raster image data and including only the invisible text layer. This significantly reduces file size but removes the visual representation of the original document.
What font does Tesseract use for the invisible text layer in PDFs?
Tesseract embeds a custom minimal TrueType font named pdf.ttf, which is compiled directly into the binary as a byte array in src/api/pdf_ttf.h. This font is used exclusively for the invisible text layer to ensure text selectability across all PDF viewers without relying on system fonts. The renderer references this font in PDF objects as /F1 within text blocks, allowing the hidden text to occupy the exact coordinates determined by the OCR engine during recognition.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →