Understanding the TextItem Confidence Field and OCR Source Tracking in LiteParse
The confidence field in LiteParse's TextItem struct serves as both a reliability metric and source indicator—None denotes native PDF text extraction, while Some(f32) values from 0.0 to 1.0 represent OCR confidence scores, with the final JSON output defaulting native items to 1.0 and preserving OCR-specific scores.
LiteParse, the Rust-based PDF parsing library from run-llama/liteparse, extracts text as structured TextItem objects that include provenance metadata. The confidence field within each TextItem functions as a dual-purpose discriminator, allowing developers to distinguish between digitally embedded PDF text and optically recognized characters while evaluating extraction reliability. This field implements Option<f32> semantics where the presence or absence of a value directly indicates the text extraction source.
How the Confidence Field Represents Text Provenance
The confidence field defined in crates/liteparse/src/types.rs uses Rust's Option<f32> type to encode origin information. When confidence is None, the text originated directly from the PDF's native text layer, extracted without optical character recognition. Conversely, when confidence contains Some(value) where value ranges between 0.0 and 1.0, the text was produced by an OCR engine, with the floating-point number representing the engine's certainty about the character recognition accuracy.
Source Code Implementation Details
Native PDF Extraction
In crates/liteparse/src/extract.rs, the native text extraction pipeline creates TextItem instances with confidence: None to indicate direct PDF text retrieval. At line 806, the implementation explicitly sets this field to None for all native extractions:
TextItem {
text: …,
x: …,
y: …,
width: …,
height: …,
font_size: None,
confidence: None, // ← native PDF text
}
OCR Engine Processing
OCR implementations in crates/liteparse/src/ocr/tesseract.rs and crates/liteparse/src/ocr/http_simple.rs normalize confidence scores from 0-100 scales to 0.0-1.0 floating-point values. The Tesseract implementation (lines 93-104) divides the raw confidence by 100.0 before constructing the TextItem:
TextItem {
…,
confidence: Some(conf), // ← OCR-derived confidence (0.0-1.0)
}
Merge Logic and Confidence Filtering
The merge module at crates/liteparse/src/ocr_merge.rs preserves OCR confidence values during the reconciliation of native and OCR text layers. At line 131, the system filters out OCR results with confidence scores less than or equal to 0.1, ensuring only high-confidence optical recognition survives the merge process while native items remain unaltered.
JSON Serialization Behavior
When serializing to JSON in crates/liteparse/src/output/json.rs (line 54), the output layer handles the None case by defaulting native text confidence to 1.0, creating a consistent numeric interface for downstream consumers:
confidence: item.confidence.or(Some(1.0)),
This transformation ensures that JSON consumers always receive a float value—1.0 for native PDF text and the original 0.0-1.0 score for OCR text.
Practical Usage Examples
The confidence field is exposed identically across LiteParse's language bindings, allowing you to implement source-aware processing logic.
Python
from liteparse import LiteParse
parser = LiteParse()
result = parser.parse("sample.pdf")
for page in result.pages:
for item in page.text_items:
print(f"Text: {item.text}")
print(f"Confidence: {item.confidence}") # 1.0 → native, <1.0 → OCR
Node.js / TypeScript
import { LiteParse } from "liteparse";
(async () => {
const parser = new LiteParse();
const result = await parser.parse("sample.pdf");
for (const page of result.pages) {
for (const ti of page.text_items) {
console.log(`Text: ${ti.text}`);
console.log(`Confidence: ${ti.confidence}`); // 1.0 = native, <1.0 = OCR
}
}
})();
In both bindings, the confidence field maps directly to the Rust core's Option<f32>, appearing as float in Python and number | null in TypeScript before JSON serialization defaults it to 1.0.
Summary
- The confidence field uses
Option<f32>whereNoneindicates native PDF text andSome(f32)indicates OCR origin. - Native extraction in
extract.rssetsconfidence: None, while OCR engines intesseract.rsandhttp_simple.rspopulate normalized 0.0-1.0 scores. - The merge logic in
ocr_merge.rspreserves confidence values and filters OCR results below 0.1 confidence. - JSON output in
json.rsdefaults native text confidence to 1.0 while preserving original OCR scores. - This architecture enables downstream applications to differentiate between digitally embedded text and optically recognized characters while assessing extraction reliability.
Frequently Asked Questions
What does a confidence value of None mean in LiteParse?
A None value in the Rust core (appearing as null in TypeScript or 1.0 after JSON serialization) indicates that the text was extracted directly from the PDF's native text layer without optical character recognition. This occurs when crates/liteparse/src/extract.rs creates TextItem instances during native PDF parsing.
How is OCR confidence normalized in LiteParse?
OCR engines typically return confidence scores on a 0-100 scale. LiteParse normalizes these to 0.0-1.0 floating-point values in crates/liteparse/src/ocr/tesseract.rs and crates/liteparse/src/ocr/http_simple.rs by dividing the raw score by 100.0 before assigning it to the TextItem struct.
Why do native PDF text items receive a confidence of 1.0 in the JSON output?
The JSON serialization layer in crates/liteparse/src/output/json.rs explicitly defaults None confidence values to Some(1.0) using item.confidence.or(Some(1.0)). This design choice provides a consistent numeric interface for API consumers while maintaining the semantic distinction that 1.0 represents perfect confidence in native text extraction.
How does LiteParse handle low-confidence OCR results?
During the merge phase implemented in crates/liteparse/src/ocr_merge.rs at line 131, LiteParse filters out OCR results with confidence scores less than or equal to 0.1. This threshold ensures that only high-confidence optical recognition contributes to the final text layer, preventing garbage characters from polluting the extraction results.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →