# How LiteParse Reconstructs Reading Order for Tables and Complex PDF Layouts

> Discover how LiteParse reconstructs reading order for tables and complex PDFs. It uses a spatial text grid, column anchors, and heuristics for accurate logical output.

- Repository: [LlamaIndex/liteparse](https://github.com/run-llama/liteparse)
- Tags: deep-dive
- Published: 2026-05-31

---

**LiteParse reconstructs reading order by projecting raw PDF coordinates into a spatial text grid that normalizes rotation, extracts column anchors, and applies table-specific heuristics to produce logical left-to-right, top-to-bottom output.**

The `run-llama/liteparse` library solves the fundamental challenge of PDF text extraction: the order of characters in the file rarely matches the order a human reader expects. By implementing a multi-stage **spatial text grid** in Rust, LiteParse transforms raw PDFium output into semantically ordered content that preserves the logical flow of tables, multi-column layouts, and even rotated diagrams.

## Rotation Normalization

PDFs often contain text rotated at 90°, 180°, or 270° for labels, headers, or diagram annotations. LiteParse handles this in `handle_rotation_reading_order` within [`crates/liteparse/src/projection.rs`](https://github.com/run-llama/liteparse/blob/main/crates/liteparse/src/projection.rs).

The algorithm detects rotated bounding boxes and rewrites their coordinates so they can be processed as standard horizontal text. Critically, overlapping rotated groups are kept together as units, while non-intersecting rotated content is isolated into separate rows. This prevents vertical labels from disrupting the reading order of surrounding paragraphs.

## Anchor Extraction and Cleaning

To identify columns, LiteParse creates **anchor keys** for each non-rotated bounding box. The `extract_block_anchors` function generates left, right, and center anchors using quarter-point granularity on the x-axis, effectively marking where columns begin and end.

Raw anchors contain noise from decorative elements or graphics. The pipeline cleans these using two filters in [`projection.rs`](https://github.com/run-llama/liteparse/blob/main/projection.rs):

- **`delta_min_filter`** removes isolated anchors that lack neighboring structural support
- **`intercept_filter`** eliminates anchors crossed by other text lines, which usually indicate stray graphical elements rather than column boundaries

This cleaning ensures that only meaningful vertical alignments influence the final layout.

## Block Segmentation and Flow Detection

Before applying table logic, LiteParse segments the page into **blocks**—continuous groups of lines separated by double-blank line detection via `segment_blocks`. This division allows the engine to treat different regions differently.

The `is_flowing_text_block` function analyzes anchor density, line width, column-gap frequency, and line count to distinguish between ordinary prose and structured tables. When flowing text is detected, `render_flowing_block` preserves indentation and normal spacing, ensuring that paragraphs read naturally rather than being forced into a rigid grid.

## Line Formation and Table Heuristics

The core ordering logic resides in `form_lines` inside [`projection.rs`](https://github.com/run-llama/liteparse/blob/main/projection.rs). This stage snaps items to a y-grid and sorts them by (y, x) coordinates. Adjacent items sharing the same baseline with negligible gaps merge into single words, while a second pass specifically handles numeric table formatting.

LiteParse employs three critical table-specific heuristics to prevent incorrect merging:

- **`looks_like_table_number`** identifies numeric patterns (e.g., "12,345") and prevents separate table cells from being concatenated into single tokens
- **`line_has_column_gap`** detects large horizontal gaps that indicate column boundaries in tabular data
- **Height and vertical overlap checks** stop "snowball" merging where content from adjacent columns might otherwise collapse into single rows

These heuristics ensure that financial tables, scientific data, and multi-column reports maintain their cell boundaries while still reading in the correct sequence.

## Final Ordering and Integration

After processing, lines receive a final sort by their top-most y-value, with items within each line sorted by x-value. The `items.sort_by(|a, b| a.item.y.total_cmp(&b.item.y))` call in `handle_rotation_reading_order` produces a **stable, reading-order list** that populates `ParsedPage.text` and `ParsedPage.textItems`.

The orchestrator in [`crates/liteparse/src/parser.rs`](https://github.com/run-llama/liteparse/blob/main/crates/liteparse/src/parser.rs) invokes `projection::project_pages_to_grid` at line 88-90 after any OCR merging, making this grid logic the single source of truth for text ordering throughout the library.

## Usage Examples

Both the Node.js and Python bindings expose this reconstructed reading order through the same underlying Rust pipeline.

### Node.js

```typescript
import { LiteParse } from "liteparse";

(async () => {
  const parser = new LiteParse({ ocrEnabled: false });
  const result = await parser.parse("sample-with-tables.pdf");

  // Full document text with solved reading order
  console.log(result.text);

  // Individual items with reconstructed coordinates
  for (const page of result.pages) {
    console.log(`--- page ${page.pageNum} ---`);
    for (const item of page.textItems) {
      console.log(`[${item.x.toFixed(1)},${item.y.toFixed(1)}] ${item.text}`);
    }
  }
})();

```

### Python

```python
from liteparse import LiteParse

parser = LiteParse(ocr_enabled=False)
result = parser.parse("sample-with-tables.pdf")

print(result.text)  # logical reading order

for page in result.pages:
    print(f"--- page {page.page_num} ---")
    for ti in page.text_items:
        print(f"[{ti.x:.1f},{ti.y:.1f}] {ti.text}")

```

Both examples call `LiteParse::parse_input` → `projection::project_pages_to_grid`, ensuring the reading order matches exactly what the Rust spatial grid algorithm produces.

## Summary

- **Spatial text grid** transforms raw PDF coordinates into logical reading order through nine distinct stages implemented in [`crates/liteparse/src/projection.rs`](https://github.com/run-llama/liteparse/blob/main/crates/liteparse/src/projection.rs)
- **Rotation handling** via `handle_rotation_reading_order` normalizes 90°/180°/270° text while preserving group relationships
- **Anchor extraction** (`extract_block_anchors`, `anchor_key`) identifies column boundaries with quarter-point precision, filtered by `delta_min_filter` and `intercept_filter` to remove noise
- **Block segmentation** (`segment_blocks`) separates flowing prose from tabular regions, with `is_flowing_text_block` applying density heuristics to distinguish layout types
- **Table heuristics** including `looks_like_table_number` and `line_has_column_gap` prevent incorrect cell merging while maintaining numeric formatting
- **Final sorting** produces stable left-to-right, top-to-bottom output exposed through `ParsedPage.text` and `ParsedPage.textItems`

## Frequently Asked Questions

### How does LiteParse handle rotated text in PDFs?

LiteParse detects text rotated at 90°, 180°, or 270° during the rotation normalization stage in `handle_rotation_reading_order`. It rewrites the coordinates of rotated items so they can be processed as standard horizontal text. Overlapping rotated groups remain grouped together, while isolated rotated elements are placed in separate rows to prevent them from disrupting the main content flow.

### What prevents LiteParse from merging separate table cells?

The library employs the `looks_like_table_number` heuristic to recognize numeric patterns that should remain distinct (such as "12,345"), along with `line_has_column_gap` to detect the large horizontal gaps that separate columns. Additionally, height and vertical overlap checks in `form_lines` prevent content from adjacent columns from collapsing into single rows, ensuring table cells maintain their boundaries.

### How does LiteParse distinguish between flowing text and tables?

The `is_flowing_text_block` function in [`projection.rs`](https://github.com/run-llama/liteparse/blob/main/projection.rs) analyzes anchor density, line width, column-gap frequency, and line count to classify blocks. High anchor density with frequent column gaps indicates tabular structure, while consistent line widths with normal spacing trigger the `render_flowing_block` path that preserves paragraph indentation and prose formatting.

### Can I access the reconstructed coordinates programmatically?

Yes. After parsing, both the Node.js and Python APIs expose `textItems` (or `text_items`) arrays containing each element's reconstructed x and y coordinates. These values reflect the final sorted positions after all rotation normalization and grid snapping operations complete, allowing you to map the logical reading order back to precise spatial locations on the page.