deep-dive

LiteParse JSON vs Text Output Formats: Complete Technical Comparison

May 30, 2026 run-llama/liteparse ↗

LiteParse provides JSON output for structured data with full metadata including bounding boxes and OCR confidence, while Text output delivers a flattened plain-text string optimized for human readability and simple pipelining.

The run-llama/liteparse repository offers two distinct output representations for parsed documents. Users select between JSON and Text formats through the output_format field in LiteParseConfig or the --output-format CLI argument. Both formats process the same underlying ParsedPage data structures but serve fundamentally different consumption patterns.

Structural Data Hierarchy

JSON Format Architecture

In crates/liteparse/src/output/json.rs, the format_json function serializes ParsedResult into a deeply nested JSON document using serde_json. The hierarchy follows ParsedResult → pages[] → items[], where each item contains the extracted text alongside spatial and quality metadata.

Text Format Architecture

The format_text function in crates/liteparse/src/output/text.rs flattens the same ParsedPage list into a single string. It walks the parsed structure and joins the text fields with \n delimiters to preserve natural reading order across lines and pages.

Metadata Retention and Richness

JSON output includes per-item metadata essential for programmatic processing:

bbox: Bounding box coordinates for spatial positioning
confidence: OCR confidence scores for quality auditing
source: Flags distinguishing native text from ocr results

Text output strips all metadata, emitting only the concatenated human-readable content. This makes it ideal for quick inspection or standard Unix toolchains like grep and cat.

Configuration and Implementation

The OutputFormat enum defined in crates/liteparse/src/config.rs declares both variants: OutputFormat::Json and OutputFormat::Text. The CLI dispatcher in crates/liteparse/src/main.rs matches the configuration variant to invoke the appropriate formatter:

match config.output_format {
    OutputFormat::Json => json::format_json(&result.pages)?,
    OutputFormat::Text => text::format_text(&result.pages),
}

All language bindings (Node.js, Python, WASM) expose these variants as string options "json" and "text", returning the formatted result to the respective runtime.

Practical Usage Examples

Command Line Interface

Select your format using the --output-format flag:

liteparse input.pdf --output-format json > result.json
liteparse input.pdf --output-format text > result.txt

Node.js

Configure the output format during parser initialization:

import { LiteParse } from "liteparse";

const jsonParser = new LiteParse({ output_format: "json" });
const structured = await jsonParser.parse("sample.pdf");  // JSON string

const textParser = new LiteParse({ output_format: "text" });
const plain = await textParser.parse("sample.pdf");     // Plain text

Python

The Python bindings follow the same configuration pattern:

from liteparse import LiteParse

json_result = LiteParse(output_format="json").parse("sample.pdf")
text_result = LiteParse(output_format="text").parse("sample.pdf")

WebAssembly (Browser)

In browser environments using the WASM build:

import { LiteParse } from "liteparse-wasm";

const parser = new LiteParse({ output_format: "json" });
const result = await parser.parse(file);  // Returns JSON string

Summary

JSON output in LiteParse provides machine-readable structured data with full metadata hierarchy (ParsedResult → pages → items), enabling spatial analysis and OCR quality auditing.
Text output produces a flattened plain-text string by joining extracted text with newline characters, optimized for human reading and simple command-line pipelines.
Both formats are controlled through the OutputFormat enum in crates/liteparse/src/config.rs and dispatched from crates/liteparse/src/main.rs.
The JSON formatter lives in crates/liteparse/src/output/json.rs and uses serde_json, while the text formatter resides in crates/liteparse/src/output/text.rs.

Frequently Asked Questions

When should I use JSON over Text output in LiteParse?

Use JSON when your application needs to perform spatial analysis, filter low-confidence OCR results, or render visual overlays using bounding box coordinates. Use Text for simple content extraction, full-text search, or when piping results to other command-line tools like grep or awk.

Does Text output preserve document layout?

Text output preserves natural reading order by inserting line breaks between text items and pages, but it discards spatial metadata. If you need to reconstruct the original visual layout or extract tables with positional awareness, you must use the JSON format which retains bbox coordinates for every text fragment.

Can I convert between JSON and Text formats after parsing?

LiteParse does not provide a built-in conversion utility between formats. Since Text output is a lossy representation that discards metadata, you cannot reconstruct the full JSON structure from a Text file. If you require both representations, parse the document once with output_format: "json" and extract the text fields programmatically from the resulting structure.

Which output format has better performance?

Text output is marginally faster because it avoids JSON serialization overhead and produces smaller payloads. However, both formats process the same underlying ParsedPage data structures, so the performance difference is negligible compared to the initial PDF parsing and OCR operations. Choose based on data requirements rather than performance constraints.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:

curl -s "https://instagit.com/install.md"

Add to your MCP client configuration:

{
  "mcpServers": {
    "instagit": {
      "command": "npx",
      "args": ["-y", "instagit@latest"]
    }
  }
}

Ask your agent:

"Use Instagit MCP to understand how run-llama/liteparse works."

Works with

Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →