How LiteParse's Spatial Text Projection Handles Multi-Column PDF Layouts

LiteParse uses a spatial projection algorithm that extracts geometric anchors from PDFium text boxes, detects column gaps using median-character-width heuristics, and classifies text blocks to preserve left-to-right reading order across multiple columns.

When processing PDF documents with complex layouts, run-llama/liteparse transforms raw text boxes into logically ordered content through spatial text projection. This Rust-based engine specifically addresses the challenge of multi-column layouts by combining geometric analysis with flow-text heuristics, ensuring that text from the left column is not erroneously merged with the right column during extraction.

The Challenge of Multi-Column PDF Layouts

PDF documents store text as disconnected bounding boxes without inherent reading order. In multi-column layouts, spatial proximity can trick simple extraction engines into concatenating text from column one with column two. LiteParse solves this by treating the page as a geometric coordinate system where horizontal gaps serve as structural delimiters.

Core Spatial Projection Pipeline

The spatial projection engine in crates/liteparse/src/projection.rs processes multi-column documents through four distinct phases. Each phase builds upon quantized spatial data to determine whether text flows as a single paragraph or spans multiple columns.

Step 1: Anchor Extraction and Quantization

LiteParse begins by extracting anchor keys from every non-rotated text box. The extract_block_anchors function generates three anchors per box—left, right, and center—quantized to discrete x-coordinates. These anchors populate hash maps keyed by anchor_key, creating a spatial index that serves as the backbone for column detection.

According to the source code, this quantization reduces noise from minor positional variations while preserving the structural alignment of column boundaries.

Step 2: Column-Gap Detection via line_has_column_gap

The engine identifies column separators through the line_has_column_gap function. For any line of projected items, this function searches for a large empty space that straddles the page midpoint. A gap qualifies as a column separator when it exceeds twice the median character width and the text items on either side cross the center of the page.

This heuristic prevents false positives from intra-column spacing while reliably detecting the void between distinct columns.

Step 3: Flowing-Text Classification

Multi-column rejection occurs in is_flowing_text_block. This function classifies a block as "flowing" (single-column prose) only if it contains few anchors, minimal left anchors, and critically—does not contain multiple lines with column gaps. When the function detects two or more lines exhibiting column gaps, it rejects the block as flowing text, marking it instead as a structured multi-column block requiring special handling.

Step 4: Per-Line Rendering with detect_and_render_flowing_lines

During final text assembly, detect_and_render_flowing_lines re-examines each line using has_relative_column_gap (which wraps the column-gap detection logic). If a gap qualifies as a column separator, the renderer treats it as a hard break (should_space = 1) and skips the "snowball" effect that could otherwise merge items from alternating columns. This ensures each column remains an independent line stream while text within the column merges naturally.

Implementation in projection.rs

The spatial projection logic resides primarily in crates/liteparse/src/projection.rs. Key functional blocks include:

  • Line 1009: extract_block_anchors creates the spatial backbone
  • Line 1229: line_has_column_gap implements the gap heuristic
  • Line 1248: is_flowing_text_block classifies block structure
  • Line 1902: detect_and_render_flowing_lines handles final rendering

The algorithm's reliance on median character width for threshold calculation makes it resilient to varying font sizes and page dimensions.

Usage Example: Parsing Two-Column PDFs

You can access this spatial projection pipeline through the Node.js or Python wrappers. Both interfaces call the underlying Rust core implementing the algorithm described above.

TypeScript Example

import { LiteParse } from "liteparse";

(async () => {
  const parser = new LiteParse({ outputFormat: "text" });
  const result = await parser.parse("samples/two-column.pdf");
  console.log(result.text);   // Text appears in logical left-to-right column order
})();

Python Example

from liteparse import LiteParse

parser = LiteParse(output_format="text")
result = parser.parse("samples/two-column.pdf")
print(result.text)   # Columns are correctly ordered

Summary

  • LiteParse's spatial text projection converts raw PDFium boxes into logical reading order using geometric anchors and gap detection.
  • The line_has_column_gap function identifies column boundaries by detecting large horizontal voids exceeding twice the median character width.
  • is_flowing_text_block rejects multi-column layouts from flowing-text classification, preventing horizontal concatenation.
  • Per-line rendering in detect_and_render_flowing_lines treats column gaps as hard breaks, preserving independent column streams.
  • All spatial logic is implemented in crates/liteparse/src/projection.rs within the run-llama/liteparse repository.

Frequently Asked Questions

What is spatial text projection in LiteParse?

Spatial text projection is the algorithmic process that transforms geometric text boxes extracted from PDFium into a logical reading order. It analyzes x-coordinates, detects column structures, and determines whether text blocks flow as single columns or span multiple columns, ensuring the output reflects the document's visual layout.

How does LiteParse detect column boundaries?

LiteParse detects column boundaries through the line_has_column_gap function in projection.rs. This function looks for horizontal gaps that straddle the page midpoint and exceed twice the median character width. When text appears on both sides of such a gap, the engine marks it as a column separator.

Why does LiteParse use anchor maps instead of simple sorting?

Anchor maps provide quantized spatial indices that normalize minor positional variations in PDF text boxes. By extracting left, right, and center anchors via extract_block_anchors, LiteParse creates a stable geometric backbone that survives font changes and slight misalignments, something simple x-coordinate sorting cannot handle reliably.

Can LiteParse handle more than two columns?

Yes. While the documentation emphasizes two-column layouts, the underlying heuristic—detecting gaps that straddle the midpoint and cross the center—scales to any layout where columns create distinct horizontal voids. The algorithm treats any qualifying gap as a structural delimiter regardless of the total column count.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →