# How to Work with Paragraph Detection and Justification in Tesseract OCR

> Learn how Tesseract OCR detects paragraph boundaries and exposes justification properties via C++ and C APIs. Optimize your text analysis today.

- Repository: [tesseract-ocr/tesseract](https://github.com/tesseract-ocr/tesseract)
- Tags: how-to-guide
- Published: 2026-03-02

---

**Tesseract detects paragraph boundaries during layout analysis and exposes justification properties (left, center, right, or unknown) through the `ResultIterator` C++ API or the `TessResultIteratorParagraphInfo` C API function.**

The tesseract-ocr/tesseract engine constructs a geometric model for every paragraph it discovers during page layout analysis. This model captures justification alignment, first-line indentation, and body-line offsets, enabling you to extract structured text with preserved formatting. Understanding how to access these paragraph-level properties is essential for document layout analysis and text reflow applications.

## Understanding the Paragraph Geometric Model

During layout analysis, the detector in [`src/ccmain/paragraphs.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/ccmain/paragraphs.cpp) examines contiguous blocks of text lines and creates a `ParagraphModel` instance for each paragraph. This model is defined in [`src/ccstruct/ocrpara.h`](https://github.com/tesseract-ocr/tesseract/blob/main/src/ccstruct/ocrpara.h) and classifies lines as either **START** (first line) or **BODY** (continuation lines) via the `RowScratchRegisters` structure.

The model tracks three geometric properties:

- **Justification**: Alignment classification as `JUSTIFICATION_LEFT`, `JUSTIFICATION_CENTER`, `JUSTIFICATION_RIGHT`, or `JUSTIFICATION_UNKNOWN` (defined in [`include/tesseract/publictypes.h`](https://github.com/tesseract-ocr/tesseract/blob/main/include/tesseract/publictypes.h))
- **First-line indent**: Horizontal offset of the paragraph's first line relative to the block margin
- **Body-line indent**: Horizontal offset applied to all subsequent lines in the paragraph

A pixel tolerance value is used when comparing measured indents against the model to handle minor variations in document scanning.

## Retrieving Paragraph Justification via the C++ API

The `ResultIterator` class in [`include/tesseract/resultiterator.h`](https://github.com/tesseract-ocr/tesseract/blob/main/include/tesseract/resultiterator.h) exposes paragraph metadata after OCR completes. To access justification information, iterate over the result at the `RIL_PARA` (paragraph) level and call `ParagraphInfo()`.

### Prerequisites for Paragraph Detection

You must initialize `TessBaseAPI` with a layout-aware page segmentation mode before recognizing text:

- `PSM_AUTO` — Fully automatic page segmentation with orientation detection
- `PSM_AUTO_OSD` — Automatic page segmentation with script detection
- `PSM_SINGLE_BLOCK` — Treat the image as a single uniform text block

Modes like `PSM_SINGLE_LINE` or `PSM_RAW_LINE` bypass layout analysis and will return `JUSTIFICATION_UNKNOWN` for all text.

### Complete C++ Implementation

```cpp
#include <tesseract/baseapi.h>
#include <tesseract/resultiterator.h>
#include <tesseract/publictypes.h>
#include <iostream>

int main() {
  tesseract::TessBaseAPI api;
  
  // Initialize with layout-aware page segmentation mode
  if (api.Init(nullptr, "eng")) return 1;
  api.SetPageSegMode(tesseract::PSM_AUTO);

  // Load image and perform OCR
  Pix* image = pixRead("document.png");
  api.SetImage(image);
  api.Recognize(nullptr);

  // Iterate over paragraphs using ResultIterator
  using ResultIt = tesseract::ResultIterator;
  for (ResultIt* it = api.GetIterator(); it; it = it->Next(tesseract::RIL_PARA)) {
    tesseract::ParagraphJustification just;
    it->ParagraphInfo(&just);
    
    // Map enum to string representation
    const char* just_str = "unknown";
    switch (just) {
      case tesseract::JUSTIFICATION_LEFT:   just_str = "left";   break;
      case tesseract::JUSTIFICATION_RIGHT:  just_str = "right";  break;
      case tesseract::JUSTIFICATION_CENTER: just_str = "center"; break;
      default: break;
    }

    // Retrieve full paragraph text with preserved line breaks
    std::string para_text;
    it->AppendUTF8ParagraphText(&para_text);

    std::cout << "Paragraph (" << just_str << " justified):\n"
              << para_text << "\n---\n";
  }

  pixDestroy(&image);
  api.End();
}

```

Key source references for this implementation:

- `ParagraphModel` class definition — [`src/ccstruct/ocrpara.h`](https://github.com/tesseract-ocr/tesseract/blob/main/src/ccstruct/ocrpara.h)
- Paragraph detection logic — [`src/ccmain/paragraphs.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/ccmain/paragraphs.cpp)
- `ResultIterator::ParagraphInfo()` declaration — [`include/tesseract/resultiterator.h`](https://github.com/tesseract-ocr/tesseract/blob/main/include/tesseract/resultiterator.h)
- `ParagraphJustification` enum values — [`include/tesseract/publictypes.h`](https://github.com/tesseract-ocr/tesseract/blob/main/include/tesseract/publictypes.h)

## Accessing Paragraph Data Through the C API

For applications using the C interface or Python via `ctypes`, Tesseract exposes paragraph justification through `TessResultIteratorParagraphInfo()` declared in [`include/tesseract/capi.h`](https://github.com/tesseract-ocr/tesseract/blob/main/include/tesseract/capi.h). This function populates an integer corresponding to the `TessParagraphJustification` enum.

### Python Example Using ctypes

```python
import ctypes
from ctypes import c_int, c_char_p, POINTER

# Load the Tesseract shared library

tess = ctypes.cdll.LoadLibrary('libtesseract.so')

# Define function signatures

tess.TessBaseAPICreate.restype = ctypes.c_void_p
tess.TessBaseAPIInit3.argtypes = [ctypes.c_void_p, c_char_p, c_char_p]
tess.TessBaseAPISetPageSegMode.argtypes = [ctypes.c_void_p, c_int]
tess.TessBaseAPISetImageFile.argtypes = [ctypes.c_void_p, c_char_p]
tess.TessBaseAPIRecognize.restype = c_int
tess.TessBaseAPICreateIterator.restype = ctypes.c_void_p
tess.TessResultIteratorParagraphInfo.argtypes = [ctypes.c_void_p, POINTER(c_int)]

# Initialize API with English language data

api = tess.TessBaseAPICreate()
if tess.TessBaseAPIInit3(api, b".", b"eng") != 0:
    raise RuntimeError("Failed to initialize Tesseract")

# Set PSM_AUTO (enum value 3) to enable layout analysis

tess.TessBaseAPISetPageSegMode(api, 3)

# Load image and recognize

tess.TessBaseAPISetImageFile(api, b"document.png")
if tess.TessBaseAPIRecognize(api, None) != 0:
    raise RuntimeError("Recognition failed")

# Iterate paragraphs and extract justification

it = tess.TessBaseAPICreateIterator(api)
while it:
    just = c_int()
    tess.TessResultIteratorParagraphInfo(it, ctypes.byref(just))
    
    # Map integer values to justification types

    # 0=UNKNOWN, 1=LEFT, 2=CENTER, 3=RIGHT

    just_map = {0: "unknown", 1: "left", 2: "center", 3: "right"}
    print(f"Paragraph justification: {just_map.get(just.value, 'unknown')}")
    
    # Advance to next paragraph (implementation depends on TessResultIteratorNext)

    break  # Replace with proper iteration logic

tess.TessBaseAPIEnd(api)

```

The C wrapper exposes `TessParagraphJustification` constants in [`include/tesseract/capi.h`](https://github.com/tesseract-ocr/tesseract/blob/main/include/tesseract/capi.h), mapping directly to the C++ enum values in [`publictypes.h`](https://github.com/tesseract-ocr/tesseract/blob/main/publictypes.h).

## Extracting Raw Geometric Models (Advanced)

For applications requiring precise indent measurements or custom text reflow, you can access the underlying `ParagraphModel` through internal page structures. The `PAGE_RES` object (defined in [`src/ccstruct/pageres.h`](https://github.com/tesseract-ocr/tesseract/blob/main/src/ccstruct/pageres.h)) contains `ParagraphTheory` objects that store raw model data including exact pixel values for first-line and body-line indents.

```cpp
// Access internal page results (requires including pageres.h)
PAGE_RES* page_res = api.GetMutableIterator()->PageRes();

// Iterate over blocks to find paragraph models
for (int b = 0; b < page_res->block_res->length(); ++b) {
  BLOCK_RES* block_res = page_res->block_res->at(b);
  BLOCK* block = block_res->block;
  
  // Paragraph objects reside in block->paralist
  for (PARA_IT p(&block->paralist); !p.empty(); p.forward()) {
    const tesseract::ParagraphModel* model = p.data()->model;
    if (model) {
      // ToString() outputs formatted indent and justification data
      std::cout << "Paragraph model: " << model->ToString() << "\n";
    }
  }
}

```

Key internal structures for advanced paragraph geometry:

- `ParagraphTheory` — Manages model creation and lookup in [`src/ccmain/paragraphs.h`](https://github.com/tesseract-ocr/tesseract/blob/main/src/ccmain/paragraphs.h)
- `PARA` struct — Container for paragraph lines and associated model in [`src/ccstruct/ocrpara.h`](https://github.com/tesseract-ocr/tesseract/blob/main/src/ccstruct/ocrpara.h)
- `PAGE_RES` — Top-level page result structure in [`src/ccstruct/pageres.h`](https://github.com/tesseract-ocr/tesseract/blob/main/src/ccstruct/pageres.h)

## Summary

- **Enable layout analysis** by setting `PSM_AUTO`, `PSM_AUTO_OSD`, or `PSM_SINGLE_BLOCK` via `SetPageSegMode()` before calling `Recognize()`
- **Query justification** using `ResultIterator::ParagraphInfo()` in C++ or `TessResultIteratorParagraphInfo()` in C to receive `JUSTIFICATION_LEFT`, `JUSTIFICATION_CENTER`, `JUSTIFICATION_RIGHT`, or `JUSTIFICATION_UNKNOWN`
- **Extract paragraph text** with preserved line breaks through `AppendUTF8ParagraphText()` rather than iterating word-by-word
- **Access raw geometry** by inspecting `ParagraphModel` instances through `PAGE_RES` and `ParagraphTheory` when exact indent values are required
- **Source files** governing this functionality are [`src/ccmain/paragraphs.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/ccmain/paragraphs.cpp), [`src/ccstruct/ocrpara.h`](https://github.com/tesseract-ocr/tesseract/blob/main/src/ccstruct/ocrpara.h), [`include/tesseract/resultiterator.h`](https://github.com/tesseract-ocr/tesseract/blob/main/include/tesseract/resultiterator.h), and [`include/tesseract/publictypes.h`](https://github.com/tesseract-ocr/tesseract/blob/main/include/tesseract/publictypes.h)

## Frequently Asked Questions

### Which page segmentation modes support paragraph detection?

Only modes performing full layout analysis enable paragraph detection. Use `PSM_AUTO` (fully automatic), `PSM_AUTO_OSD` (with orientation and script detection), or `PSM_SINGLE_BLOCK` (single uniform block). Modes like `PSM_SINGLE_LINE`, `PSM_SINGLE_WORD`, or `PSM_RAW_LINE` skip geometric analysis and return `JUSTIFICATION_UNKNOWN` for all text blocks.

### Can I get the exact pixel indent values for paragraphs?

Yes, but you must use the internal API rather than the public `ResultIterator`. The `ParagraphModel` class in [`src/ccstruct/ocrpara.h`](https://github.com/tesseract-ocr/tesseract/blob/main/src/ccstruct/ocrpara.h) stores `first_indent` and `body_indent` member variables. Access these by traversing the `PAGE_RES` structure from [`src/ccstruct/pageres.h`](https://github.com/tesseract-ocr/tesseract/blob/main/src/ccstruct/pageres.h) and examining the `ParagraphTheory` objects attached to each block. This requires linking against Tesseract internals and including the private headers.

### What does the paragraph justification "unknown" mean?

`JUSTIFICATION_UNKNOWN` indicates that the geometric detector in [`src/ccmain/paragraphs.cpp`](https://github.com/tesseract-ocr/tesseract/blob/main/src/ccmain/paragraphs.cpp) could not determine alignment with sufficient confidence, or that layout analysis was disabled. This commonly occurs with narrow text columns, irregular character spacing, centered text in small widths, or when using page segmentation modes that treat the image as a single line or word.

### Is paragraph detection available in the Tesseract command line tool?

The command line `tesseract` binary outputs plain text, HOCR, or TSV formats, but does not expose paragraph justification metadata in these outputs. To programmatically access `ParagraphInfo()` and justification properties, you must use the C++ API (`tesseract::ResultIterator`) or the C API (`TessResultIteratorParagraphInfo`) as shown in the code examples above.