How to Work with Paragraph Detection and Justification in Tesseract OCR

Tesseract detects paragraph boundaries during layout analysis and exposes justification properties (left, center, right, or unknown) through the ResultIterator C++ API or the TessResultIteratorParagraphInfo C API function.

The tesseract-ocr/tesseract engine constructs a geometric model for every paragraph it discovers during page layout analysis. This model captures justification alignment, first-line indentation, and body-line offsets, enabling you to extract structured text with preserved formatting. Understanding how to access these paragraph-level properties is essential for document layout analysis and text reflow applications.

Understanding the Paragraph Geometric Model

During layout analysis, the detector in src/ccmain/paragraphs.cpp examines contiguous blocks of text lines and creates a ParagraphModel instance for each paragraph. This model is defined in src/ccstruct/ocrpara.h and classifies lines as either START (first line) or BODY (continuation lines) via the RowScratchRegisters structure.

The model tracks three geometric properties:

  • Justification: Alignment classification as JUSTIFICATION_LEFT, JUSTIFICATION_CENTER, JUSTIFICATION_RIGHT, or JUSTIFICATION_UNKNOWN (defined in include/tesseract/publictypes.h)
  • First-line indent: Horizontal offset of the paragraph's first line relative to the block margin
  • Body-line indent: Horizontal offset applied to all subsequent lines in the paragraph

A pixel tolerance value is used when comparing measured indents against the model to handle minor variations in document scanning.

Retrieving Paragraph Justification via the C++ API

The ResultIterator class in include/tesseract/resultiterator.h exposes paragraph metadata after OCR completes. To access justification information, iterate over the result at the RIL_PARA (paragraph) level and call ParagraphInfo().

Prerequisites for Paragraph Detection

You must initialize TessBaseAPI with a layout-aware page segmentation mode before recognizing text:

  • PSM_AUTO — Fully automatic page segmentation with orientation detection
  • PSM_AUTO_OSD — Automatic page segmentation with script detection
  • PSM_SINGLE_BLOCK — Treat the image as a single uniform text block

Modes like PSM_SINGLE_LINE or PSM_RAW_LINE bypass layout analysis and will return JUSTIFICATION_UNKNOWN for all text.

Complete C++ Implementation

#include <tesseract/baseapi.h>
#include <tesseract/resultiterator.h>
#include <tesseract/publictypes.h>
#include <iostream>

int main() {
  tesseract::TessBaseAPI api;
  
  // Initialize with layout-aware page segmentation mode
  if (api.Init(nullptr, "eng")) return 1;
  api.SetPageSegMode(tesseract::PSM_AUTO);

  // Load image and perform OCR
  Pix* image = pixRead("document.png");
  api.SetImage(image);
  api.Recognize(nullptr);

  // Iterate over paragraphs using ResultIterator
  using ResultIt = tesseract::ResultIterator;
  for (ResultIt* it = api.GetIterator(); it; it = it->Next(tesseract::RIL_PARA)) {
    tesseract::ParagraphJustification just;
    it->ParagraphInfo(&just);
    
    // Map enum to string representation
    const char* just_str = "unknown";
    switch (just) {
      case tesseract::JUSTIFICATION_LEFT:   just_str = "left";   break;
      case tesseract::JUSTIFICATION_RIGHT:  just_str = "right";  break;
      case tesseract::JUSTIFICATION_CENTER: just_str = "center"; break;
      default: break;
    }

    // Retrieve full paragraph text with preserved line breaks
    std::string para_text;
    it->AppendUTF8ParagraphText(&para_text);

    std::cout << "Paragraph (" << just_str << " justified):\n"
              << para_text << "\n---\n";
  }

  pixDestroy(&image);
  api.End();
}

Key source references for this implementation:

Accessing Paragraph Data Through the C API

For applications using the C interface or Python via ctypes, Tesseract exposes paragraph justification through TessResultIteratorParagraphInfo() declared in include/tesseract/capi.h. This function populates an integer corresponding to the TessParagraphJustification enum.

Python Example Using ctypes

import ctypes
from ctypes import c_int, c_char_p, POINTER

# Load the Tesseract shared library

tess = ctypes.cdll.LoadLibrary('libtesseract.so')

# Define function signatures

tess.TessBaseAPICreate.restype = ctypes.c_void_p
tess.TessBaseAPIInit3.argtypes = [ctypes.c_void_p, c_char_p, c_char_p]
tess.TessBaseAPISetPageSegMode.argtypes = [ctypes.c_void_p, c_int]
tess.TessBaseAPISetImageFile.argtypes = [ctypes.c_void_p, c_char_p]
tess.TessBaseAPIRecognize.restype = c_int
tess.TessBaseAPICreateIterator.restype = ctypes.c_void_p
tess.TessResultIteratorParagraphInfo.argtypes = [ctypes.c_void_p, POINTER(c_int)]

# Initialize API with English language data

api = tess.TessBaseAPICreate()
if tess.TessBaseAPIInit3(api, b".", b"eng") != 0:
    raise RuntimeError("Failed to initialize Tesseract")

# Set PSM_AUTO (enum value 3) to enable layout analysis

tess.TessBaseAPISetPageSegMode(api, 3)

# Load image and recognize

tess.TessBaseAPISetImageFile(api, b"document.png")
if tess.TessBaseAPIRecognize(api, None) != 0:
    raise RuntimeError("Recognition failed")

# Iterate paragraphs and extract justification

it = tess.TessBaseAPICreateIterator(api)
while it:
    just = c_int()
    tess.TessResultIteratorParagraphInfo(it, ctypes.byref(just))
    
    # Map integer values to justification types

    # 0=UNKNOWN, 1=LEFT, 2=CENTER, 3=RIGHT

    just_map = {0: "unknown", 1: "left", 2: "center", 3: "right"}
    print(f"Paragraph justification: {just_map.get(just.value, 'unknown')}")
    
    # Advance to next paragraph (implementation depends on TessResultIteratorNext)

    break  # Replace with proper iteration logic

tess.TessBaseAPIEnd(api)

The C wrapper exposes TessParagraphJustification constants in include/tesseract/capi.h, mapping directly to the C++ enum values in publictypes.h.

Extracting Raw Geometric Models (Advanced)

For applications requiring precise indent measurements or custom text reflow, you can access the underlying ParagraphModel through internal page structures. The PAGE_RES object (defined in src/ccstruct/pageres.h) contains ParagraphTheory objects that store raw model data including exact pixel values for first-line and body-line indents.

// Access internal page results (requires including pageres.h)
PAGE_RES* page_res = api.GetMutableIterator()->PageRes();

// Iterate over blocks to find paragraph models
for (int b = 0; b < page_res->block_res->length(); ++b) {
  BLOCK_RES* block_res = page_res->block_res->at(b);
  BLOCK* block = block_res->block;
  
  // Paragraph objects reside in block->paralist
  for (PARA_IT p(&block->paralist); !p.empty(); p.forward()) {
    const tesseract::ParagraphModel* model = p.data()->model;
    if (model) {
      // ToString() outputs formatted indent and justification data
      std::cout << "Paragraph model: " << model->ToString() << "\n";
    }
  }
}

Key internal structures for advanced paragraph geometry:

Summary

  • Enable layout analysis by setting PSM_AUTO, PSM_AUTO_OSD, or PSM_SINGLE_BLOCK via SetPageSegMode() before calling Recognize()
  • Query justification using ResultIterator::ParagraphInfo() in C++ or TessResultIteratorParagraphInfo() in C to receive JUSTIFICATION_LEFT, JUSTIFICATION_CENTER, JUSTIFICATION_RIGHT, or JUSTIFICATION_UNKNOWN
  • Extract paragraph text with preserved line breaks through AppendUTF8ParagraphText() rather than iterating word-by-word
  • Access raw geometry by inspecting ParagraphModel instances through PAGE_RES and ParagraphTheory when exact indent values are required
  • Source files governing this functionality are src/ccmain/paragraphs.cpp, src/ccstruct/ocrpara.h, include/tesseract/resultiterator.h, and include/tesseract/publictypes.h

Frequently Asked Questions

Which page segmentation modes support paragraph detection?

Only modes performing full layout analysis enable paragraph detection. Use PSM_AUTO (fully automatic), PSM_AUTO_OSD (with orientation and script detection), or PSM_SINGLE_BLOCK (single uniform block). Modes like PSM_SINGLE_LINE, PSM_SINGLE_WORD, or PSM_RAW_LINE skip geometric analysis and return JUSTIFICATION_UNKNOWN for all text blocks.

Can I get the exact pixel indent values for paragraphs?

Yes, but you must use the internal API rather than the public ResultIterator. The ParagraphModel class in src/ccstruct/ocrpara.h stores first_indent and body_indent member variables. Access these by traversing the PAGE_RES structure from src/ccstruct/pageres.h and examining the ParagraphTheory objects attached to each block. This requires linking against Tesseract internals and including the private headers.

What does the paragraph justification "unknown" mean?

JUSTIFICATION_UNKNOWN indicates that the geometric detector in src/ccmain/paragraphs.cpp could not determine alignment with sufficient confidence, or that layout analysis was disabled. This commonly occurs with narrow text columns, irregular character spacing, centered text in small widths, or when using page segmentation modes that treat the image as a single line or word.

Is paragraph detection available in the Tesseract command line tool?

The command line tesseract binary outputs plain text, HOCR, or TSV formats, but does not expose paragraph justification metadata in these outputs. To programmatically access ParagraphInfo() and justification properties, you must use the C++ API (tesseract::ResultIterator) or the C API (TessResultIteratorParagraphInfo) as shown in the code examples above.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →