How to Work with Paragraph Detection and Justification in Tesseract OCR
Tesseract detects paragraph boundaries during layout analysis and exposes justification properties (left, center, right, or unknown) through the ResultIterator C++ API or the TessResultIteratorParagraphInfo C API function.
The tesseract-ocr/tesseract engine constructs a geometric model for every paragraph it discovers during page layout analysis. This model captures justification alignment, first-line indentation, and body-line offsets, enabling you to extract structured text with preserved formatting. Understanding how to access these paragraph-level properties is essential for document layout analysis and text reflow applications.
Understanding the Paragraph Geometric Model
During layout analysis, the detector in src/ccmain/paragraphs.cpp examines contiguous blocks of text lines and creates a ParagraphModel instance for each paragraph. This model is defined in src/ccstruct/ocrpara.h and classifies lines as either START (first line) or BODY (continuation lines) via the RowScratchRegisters structure.
The model tracks three geometric properties:
- Justification: Alignment classification as
JUSTIFICATION_LEFT,JUSTIFICATION_CENTER,JUSTIFICATION_RIGHT, orJUSTIFICATION_UNKNOWN(defined ininclude/tesseract/publictypes.h) - First-line indent: Horizontal offset of the paragraph's first line relative to the block margin
- Body-line indent: Horizontal offset applied to all subsequent lines in the paragraph
A pixel tolerance value is used when comparing measured indents against the model to handle minor variations in document scanning.
Retrieving Paragraph Justification via the C++ API
The ResultIterator class in include/tesseract/resultiterator.h exposes paragraph metadata after OCR completes. To access justification information, iterate over the result at the RIL_PARA (paragraph) level and call ParagraphInfo().
Prerequisites for Paragraph Detection
You must initialize TessBaseAPI with a layout-aware page segmentation mode before recognizing text:
PSM_AUTO— Fully automatic page segmentation with orientation detectionPSM_AUTO_OSD— Automatic page segmentation with script detectionPSM_SINGLE_BLOCK— Treat the image as a single uniform text block
Modes like PSM_SINGLE_LINE or PSM_RAW_LINE bypass layout analysis and will return JUSTIFICATION_UNKNOWN for all text.
Complete C++ Implementation
#include <tesseract/baseapi.h>
#include <tesseract/resultiterator.h>
#include <tesseract/publictypes.h>
#include <iostream>
int main() {
tesseract::TessBaseAPI api;
// Initialize with layout-aware page segmentation mode
if (api.Init(nullptr, "eng")) return 1;
api.SetPageSegMode(tesseract::PSM_AUTO);
// Load image and perform OCR
Pix* image = pixRead("document.png");
api.SetImage(image);
api.Recognize(nullptr);
// Iterate over paragraphs using ResultIterator
using ResultIt = tesseract::ResultIterator;
for (ResultIt* it = api.GetIterator(); it; it = it->Next(tesseract::RIL_PARA)) {
tesseract::ParagraphJustification just;
it->ParagraphInfo(&just);
// Map enum to string representation
const char* just_str = "unknown";
switch (just) {
case tesseract::JUSTIFICATION_LEFT: just_str = "left"; break;
case tesseract::JUSTIFICATION_RIGHT: just_str = "right"; break;
case tesseract::JUSTIFICATION_CENTER: just_str = "center"; break;
default: break;
}
// Retrieve full paragraph text with preserved line breaks
std::string para_text;
it->AppendUTF8ParagraphText(¶_text);
std::cout << "Paragraph (" << just_str << " justified):\n"
<< para_text << "\n---\n";
}
pixDestroy(&image);
api.End();
}
Key source references for this implementation:
ParagraphModelclass definition —src/ccstruct/ocrpara.h- Paragraph detection logic —
src/ccmain/paragraphs.cpp ResultIterator::ParagraphInfo()declaration —include/tesseract/resultiterator.hParagraphJustificationenum values —include/tesseract/publictypes.h
Accessing Paragraph Data Through the C API
For applications using the C interface or Python via ctypes, Tesseract exposes paragraph justification through TessResultIteratorParagraphInfo() declared in include/tesseract/capi.h. This function populates an integer corresponding to the TessParagraphJustification enum.
Python Example Using ctypes
import ctypes
from ctypes import c_int, c_char_p, POINTER
# Load the Tesseract shared library
tess = ctypes.cdll.LoadLibrary('libtesseract.so')
# Define function signatures
tess.TessBaseAPICreate.restype = ctypes.c_void_p
tess.TessBaseAPIInit3.argtypes = [ctypes.c_void_p, c_char_p, c_char_p]
tess.TessBaseAPISetPageSegMode.argtypes = [ctypes.c_void_p, c_int]
tess.TessBaseAPISetImageFile.argtypes = [ctypes.c_void_p, c_char_p]
tess.TessBaseAPIRecognize.restype = c_int
tess.TessBaseAPICreateIterator.restype = ctypes.c_void_p
tess.TessResultIteratorParagraphInfo.argtypes = [ctypes.c_void_p, POINTER(c_int)]
# Initialize API with English language data
api = tess.TessBaseAPICreate()
if tess.TessBaseAPIInit3(api, b".", b"eng") != 0:
raise RuntimeError("Failed to initialize Tesseract")
# Set PSM_AUTO (enum value 3) to enable layout analysis
tess.TessBaseAPISetPageSegMode(api, 3)
# Load image and recognize
tess.TessBaseAPISetImageFile(api, b"document.png")
if tess.TessBaseAPIRecognize(api, None) != 0:
raise RuntimeError("Recognition failed")
# Iterate paragraphs and extract justification
it = tess.TessBaseAPICreateIterator(api)
while it:
just = c_int()
tess.TessResultIteratorParagraphInfo(it, ctypes.byref(just))
# Map integer values to justification types
# 0=UNKNOWN, 1=LEFT, 2=CENTER, 3=RIGHT
just_map = {0: "unknown", 1: "left", 2: "center", 3: "right"}
print(f"Paragraph justification: {just_map.get(just.value, 'unknown')}")
# Advance to next paragraph (implementation depends on TessResultIteratorNext)
break # Replace with proper iteration logic
tess.TessBaseAPIEnd(api)
The C wrapper exposes TessParagraphJustification constants in include/tesseract/capi.h, mapping directly to the C++ enum values in publictypes.h.
Extracting Raw Geometric Models (Advanced)
For applications requiring precise indent measurements or custom text reflow, you can access the underlying ParagraphModel through internal page structures. The PAGE_RES object (defined in src/ccstruct/pageres.h) contains ParagraphTheory objects that store raw model data including exact pixel values for first-line and body-line indents.
// Access internal page results (requires including pageres.h)
PAGE_RES* page_res = api.GetMutableIterator()->PageRes();
// Iterate over blocks to find paragraph models
for (int b = 0; b < page_res->block_res->length(); ++b) {
BLOCK_RES* block_res = page_res->block_res->at(b);
BLOCK* block = block_res->block;
// Paragraph objects reside in block->paralist
for (PARA_IT p(&block->paralist); !p.empty(); p.forward()) {
const tesseract::ParagraphModel* model = p.data()->model;
if (model) {
// ToString() outputs formatted indent and justification data
std::cout << "Paragraph model: " << model->ToString() << "\n";
}
}
}
Key internal structures for advanced paragraph geometry:
ParagraphTheory— Manages model creation and lookup insrc/ccmain/paragraphs.hPARAstruct — Container for paragraph lines and associated model insrc/ccstruct/ocrpara.hPAGE_RES— Top-level page result structure insrc/ccstruct/pageres.h
Summary
- Enable layout analysis by setting
PSM_AUTO,PSM_AUTO_OSD, orPSM_SINGLE_BLOCKviaSetPageSegMode()before callingRecognize() - Query justification using
ResultIterator::ParagraphInfo()in C++ orTessResultIteratorParagraphInfo()in C to receiveJUSTIFICATION_LEFT,JUSTIFICATION_CENTER,JUSTIFICATION_RIGHT, orJUSTIFICATION_UNKNOWN - Extract paragraph text with preserved line breaks through
AppendUTF8ParagraphText()rather than iterating word-by-word - Access raw geometry by inspecting
ParagraphModelinstances throughPAGE_RESandParagraphTheorywhen exact indent values are required - Source files governing this functionality are
src/ccmain/paragraphs.cpp,src/ccstruct/ocrpara.h,include/tesseract/resultiterator.h, andinclude/tesseract/publictypes.h
Frequently Asked Questions
Which page segmentation modes support paragraph detection?
Only modes performing full layout analysis enable paragraph detection. Use PSM_AUTO (fully automatic), PSM_AUTO_OSD (with orientation and script detection), or PSM_SINGLE_BLOCK (single uniform block). Modes like PSM_SINGLE_LINE, PSM_SINGLE_WORD, or PSM_RAW_LINE skip geometric analysis and return JUSTIFICATION_UNKNOWN for all text blocks.
Can I get the exact pixel indent values for paragraphs?
Yes, but you must use the internal API rather than the public ResultIterator. The ParagraphModel class in src/ccstruct/ocrpara.h stores first_indent and body_indent member variables. Access these by traversing the PAGE_RES structure from src/ccstruct/pageres.h and examining the ParagraphTheory objects attached to each block. This requires linking against Tesseract internals and including the private headers.
What does the paragraph justification "unknown" mean?
JUSTIFICATION_UNKNOWN indicates that the geometric detector in src/ccmain/paragraphs.cpp could not determine alignment with sufficient confidence, or that layout analysis was disabled. This commonly occurs with narrow text columns, irregular character spacing, centered text in small widths, or when using page segmentation modes that treat the image as a single line or word.
Is paragraph detection available in the Tesseract command line tool?
The command line tesseract binary outputs plain text, HOCR, or TSV formats, but does not expose paragraph justification metadata in these outputs. To programmatically access ParagraphInfo() and justification properties, you must use the C++ API (tesseract::ResultIterator) or the C API (TessResultIteratorParagraphInfo) as shown in the code examples above.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →