# Understanding the TextItem Confidence Field and OCR Source Tracking in LiteParse

> Unlock LiteParse's TextItem confidence field and OCR source tracking. Understand how native text extraction and OCR scores are represented and interpreted for reliable data.

- Repository: [LlamaIndex/liteparse](https://github.com/run-llama/liteparse)
- Tags: deep-dive
- Published: 2026-05-31

---

**The `confidence` field in LiteParse's `TextItem` struct serves as both a reliability metric and source indicator—`None` denotes native PDF text extraction, while `Some(f32)` values from 0.0 to 1.0 represent OCR confidence scores, with the final JSON output defaulting native items to 1.0 and preserving OCR-specific scores.**

LiteParse, the Rust-based PDF parsing library from run-llama/liteparse, extracts text as structured `TextItem` objects that include provenance metadata. The **confidence field** within each `TextItem` functions as a dual-purpose discriminator, allowing developers to distinguish between digitally embedded PDF text and optically recognized characters while evaluating extraction reliability. This field implements `Option<f32>` semantics where the presence or absence of a value directly indicates the text extraction source.

## How the Confidence Field Represents Text Provenance

The `confidence` field defined in [`crates/liteparse/src/types.rs`](https://github.com/run-llama/liteparse/blob/main/crates/liteparse/src/types.rs) uses Rust's `Option<f32>` type to encode origin information. When `confidence` is `None`, the text originated directly from the PDF's native text layer, extracted without optical character recognition. Conversely, when `confidence` contains `Some(value)` where value ranges between 0.0 and 1.0, the text was produced by an OCR engine, with the floating-point number representing the engine's certainty about the character recognition accuracy.

## Source Code Implementation Details

### Native PDF Extraction

In [`crates/liteparse/src/extract.rs`](https://github.com/run-llama/liteparse/blob/main/crates/liteparse/src/extract.rs), the native text extraction pipeline creates `TextItem` instances with `confidence: None` to indicate direct PDF text retrieval. At line 806, the implementation explicitly sets this field to `None` for all native extractions:

```rust
TextItem {
    text: …,
    x: …,
    y: …,
    width: …,
    height: …,
    font_size: None,
    confidence: None,               // ← native PDF text
}

```

### OCR Engine Processing

OCR implementations in [`crates/liteparse/src/ocr/tesseract.rs`](https://github.com/run-llama/liteparse/blob/main/crates/liteparse/src/ocr/tesseract.rs) and [`crates/liteparse/src/ocr/http_simple.rs`](https://github.com/run-llama/liteparse/blob/main/crates/liteparse/src/ocr/http_simple.rs) normalize confidence scores from 0-100 scales to 0.0-1.0 floating-point values. The Tesseract implementation (lines 93-104) divides the raw confidence by 100.0 before constructing the `TextItem`:

```rust
TextItem {
    …,
    confidence: Some(conf),          // ← OCR-derived confidence (0.0-1.0)
}

```

### Merge Logic and Confidence Filtering

The merge module at [`crates/liteparse/src/ocr_merge.rs`](https://github.com/run-llama/liteparse/blob/main/crates/liteparse/src/ocr_merge.rs) preserves OCR confidence values during the reconciliation of native and OCR text layers. At line 131, the system filters out OCR results with confidence scores less than or equal to 0.1, ensuring only high-confidence optical recognition survives the merge process while native items remain unaltered.

### JSON Serialization Behavior

When serializing to JSON in [`crates/liteparse/src/output/json.rs`](https://github.com/run-llama/liteparse/blob/main/crates/liteparse/src/output/json.rs) (line 54), the output layer handles the `None` case by defaulting native text confidence to 1.0, creating a consistent numeric interface for downstream consumers:

```rust
confidence: item.confidence.or(Some(1.0)),

```

This transformation ensures that JSON consumers always receive a float value—1.0 for native PDF text and the original 0.0-1.0 score for OCR text.

## Practical Usage Examples

The confidence field is exposed identically across LiteParse's language bindings, allowing you to implement source-aware processing logic.

**Python**

```python
from liteparse import LiteParse

parser = LiteParse()
result = parser.parse("sample.pdf")
for page in result.pages:
    for item in page.text_items:
        print(f"Text: {item.text}")
        print(f"Confidence: {item.confidence}")   # 1.0 → native, <1.0 → OCR

```

**Node.js / TypeScript**

```typescript
import { LiteParse } from "liteparse";

(async () => {
  const parser = new LiteParse();
  const result = await parser.parse("sample.pdf");
  for (const page of result.pages) {
    for (const ti of page.text_items) {
      console.log(`Text: ${ti.text}`);
      console.log(`Confidence: ${ti.confidence}`); // 1.0 = native, <1.0 = OCR
    }
  }
})();

```

In both bindings, the `confidence` field maps directly to the Rust core's `Option<f32>`, appearing as `float` in Python and `number | null` in TypeScript before JSON serialization defaults it to 1.0.

## Summary

- The **confidence field** uses `Option<f32>` where `None` indicates native PDF text and `Some(f32)` indicates OCR origin.
- Native extraction in [`extract.rs`](https://github.com/run-llama/liteparse/blob/main/extract.rs) sets `confidence: None`, while OCR engines in [`tesseract.rs`](https://github.com/run-llama/liteparse/blob/main/tesseract.rs) and [`http_simple.rs`](https://github.com/run-llama/liteparse/blob/main/http_simple.rs) populate normalized 0.0-1.0 scores.
- The merge logic in [`ocr_merge.rs`](https://github.com/run-llama/liteparse/blob/main/ocr_merge.rs) preserves confidence values and filters OCR results below 0.1 confidence.
- JSON output in [`json.rs`](https://github.com/run-llama/liteparse/blob/main/json.rs) defaults native text confidence to 1.0 while preserving original OCR scores.
- This architecture enables downstream applications to differentiate between digitally embedded text and optically recognized characters while assessing extraction reliability.

## Frequently Asked Questions

### What does a confidence value of None mean in LiteParse?

A `None` value in the Rust core (appearing as `null` in TypeScript or `1.0` after JSON serialization) indicates that the text was extracted directly from the PDF's native text layer without optical character recognition. This occurs when [`crates/liteparse/src/extract.rs`](https://github.com/run-llama/liteparse/blob/main/crates/liteparse/src/extract.rs) creates `TextItem` instances during native PDF parsing.

### How is OCR confidence normalized in LiteParse?

OCR engines typically return confidence scores on a 0-100 scale. LiteParse normalizes these to 0.0-1.0 floating-point values in [`crates/liteparse/src/ocr/tesseract.rs`](https://github.com/run-llama/liteparse/blob/main/crates/liteparse/src/ocr/tesseract.rs) and [`crates/liteparse/src/ocr/http_simple.rs`](https://github.com/run-llama/liteparse/blob/main/crates/liteparse/src/ocr/http_simple.rs) by dividing the raw score by 100.0 before assigning it to the `TextItem` struct.

### Why do native PDF text items receive a confidence of 1.0 in the JSON output?

The JSON serialization layer in [`crates/liteparse/src/output/json.rs`](https://github.com/run-llama/liteparse/blob/main/crates/liteparse/src/output/json.rs) explicitly defaults `None` confidence values to `Some(1.0)` using `item.confidence.or(Some(1.0))`. This design choice provides a consistent numeric interface for API consumers while maintaining the semantic distinction that 1.0 represents perfect confidence in native text extraction.

### How does LiteParse handle low-confidence OCR results?

During the merge phase implemented in [`crates/liteparse/src/ocr_merge.rs`](https://github.com/run-llama/liteparse/blob/main/crates/liteparse/src/ocr_merge.rs) at line 131, LiteParse filters out OCR results with confidence scores less than or equal to 0.1. This threshold ensures that only high-confidence optical recognition contributes to the final text layer, preventing garbage characters from polluting the extraction results.