LiteParse JSON vs Text Output Formats: Complete Technical Comparison
LiteParse provides JSON output for structured data with full metadata including bounding boxes and OCR confidence, while Text output delivers a flattened plain-text string optimized for human readability and simple pipelining.
The run-llama/liteparse repository offers two distinct output representations for parsed documents. Users select between JSON and Text formats through the output_format field in LiteParseConfig or the --output-format CLI argument. Both formats process the same underlying ParsedPage data structures but serve fundamentally different consumption patterns.
Structural Data Hierarchy
JSON Format Architecture
In crates/liteparse/src/output/json.rs, the format_json function serializes ParsedResult into a deeply nested JSON document using serde_json. The hierarchy follows ParsedResult → pages[] → items[], where each item contains the extracted text alongside spatial and quality metadata.
Text Format Architecture
The format_text function in crates/liteparse/src/output/text.rs flattens the same ParsedPage list into a single string. It walks the parsed structure and joins the text fields with \n delimiters to preserve natural reading order across lines and pages.
Metadata Retention and Richness
JSON output includes per-item metadata essential for programmatic processing:
bbox: Bounding box coordinates for spatial positioningconfidence: OCR confidence scores for quality auditingsource: Flags distinguishingnativetext fromocrresults
Text output strips all metadata, emitting only the concatenated human-readable content. This makes it ideal for quick inspection or standard Unix toolchains like grep and cat.
Configuration and Implementation
The OutputFormat enum defined in crates/liteparse/src/config.rs declares both variants: OutputFormat::Json and OutputFormat::Text. The CLI dispatcher in crates/liteparse/src/main.rs matches the configuration variant to invoke the appropriate formatter:
match config.output_format {
OutputFormat::Json => json::format_json(&result.pages)?,
OutputFormat::Text => text::format_text(&result.pages),
}
All language bindings (Node.js, Python, WASM) expose these variants as string options "json" and "text", returning the formatted result to the respective runtime.
Practical Usage Examples
Command Line Interface
Select your format using the --output-format flag:
liteparse input.pdf --output-format json > result.json
liteparse input.pdf --output-format text > result.txt
Node.js
Configure the output format during parser initialization:
import { LiteParse } from "liteparse";
const jsonParser = new LiteParse({ output_format: "json" });
const structured = await jsonParser.parse("sample.pdf"); // JSON string
const textParser = new LiteParse({ output_format: "text" });
const plain = await textParser.parse("sample.pdf"); // Plain text
Python
The Python bindings follow the same configuration pattern:
from liteparse import LiteParse
json_result = LiteParse(output_format="json").parse("sample.pdf")
text_result = LiteParse(output_format="text").parse("sample.pdf")
WebAssembly (Browser)
In browser environments using the WASM build:
import { LiteParse } from "liteparse-wasm";
const parser = new LiteParse({ output_format: "json" });
const result = await parser.parse(file); // Returns JSON string
Summary
- JSON output in LiteParse provides machine-readable structured data with full metadata hierarchy (
ParsedResult → pages → items), enabling spatial analysis and OCR quality auditing. - Text output produces a flattened plain-text string by joining extracted text with newline characters, optimized for human reading and simple command-line pipelines.
- Both formats are controlled through the
OutputFormatenum incrates/liteparse/src/config.rsand dispatched fromcrates/liteparse/src/main.rs. - The JSON formatter lives in
crates/liteparse/src/output/json.rsand usesserde_json, while the text formatter resides incrates/liteparse/src/output/text.rs.
Frequently Asked Questions
When should I use JSON over Text output in LiteParse?
Use JSON when your application needs to perform spatial analysis, filter low-confidence OCR results, or render visual overlays using bounding box coordinates. Use Text for simple content extraction, full-text search, or when piping results to other command-line tools like grep or awk.
Does Text output preserve document layout?
Text output preserves natural reading order by inserting line breaks between text items and pages, but it discards spatial metadata. If you need to reconstruct the original visual layout or extract tables with positional awareness, you must use the JSON format which retains bbox coordinates for every text fragment.
Can I convert between JSON and Text formats after parsing?
LiteParse does not provide a built-in conversion utility between formats. Since Text output is a lossy representation that discards metadata, you cannot reconstruct the full JSON structure from a Text file. If you require both representations, parse the document once with output_format: "json" and extract the text fields programmatically from the resulting structure.
Which output format has better performance?
Text output is marginally faster because it avoids JSON serialization overhead and produces smaller payloads. However, both formats process the same underlying ParsedPage data structures, so the performance difference is negligible compared to the initial PDF parsing and OCR operations. Choose based on data requirements rather than performance constraints.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →