# How to Use LiteParse TESSDATA_PREFIX for Air-Gapped OCR Environments

> Configure LiteParse's TESSDATA_PREFIX for air-gapped OCR. Point to local traineddata files for fully offline text recognition. Enhance your secure OCR workflows.

- Repository: [LlamaIndex/liteparse](https://github.com/run-llama/liteparse)
- Tags: how-to-guide
- Published: 2026-05-30

---

**Set the `TESSDATA_PREFIX` environment variable or pass the `tessdata_path` configuration option to point LiteParse at a local directory containing Tesseract `*.traineddata` files, enabling fully offline text recognition without internet access.**

LiteParse, the open-source document parsing library from the `run-llama/liteparse` repository, performs optical character recognition through a pluggable **OcrEngine** trait backed by Tesseract. In secure or air-gapped environments without internet access, you must configure **LiteParse TESSDATA_PREFIX** handling to reference locally stored language model files rather than relying on automatic downloads or default system paths.

## How LiteParse Resolves TESSDATA_PREFIX

According to the source code in [`crates/liteparse/src/ocr/tesseract.rs`](https://github.com/run-llama/liteparse/blob/main/crates/liteparse/src/ocr/tesseract.rs), the Tesseract engine resolves the tessdata directory through a strict precedence hierarchy. The `recognize` method implements this logic at lines 60-67:

```rust
let tessdata_path = self
    .tessdata_path
    .clone()
    .or_else(|| std::env::var("TESSDATA_PREFIX").ok());

```

This means LiteParse checks three sources in order:

1. **Explicit configuration** – the `tessdata_path` field set programmatically.
2. **Environment variable** – `TESSDATA_PREFIX`.
3. **Built-in default** – provided by `default_tessdata_dir()` in the same file, which resolves to `~/.tesseract-rs/tessdata` on Linux/macOS or a bundled `"tessdata"` folder on other platforms.

The `LiteParseConfig` struct in [`crates/liteparse/src/config.rs`](https://github.com/run-llama/liteparse/blob/main/crates/liteparse/src/config.rs) documents this fallback behavior in its field comments at lines 12-14.

## Configuration Methods for Air-Gapped OCR

### Environment Variable (TESSDATA_PREFIX)

The simplest method for air-gapped deployments is exporting `TESSDATA_PREFIX` to the absolute path of your local tessdata directory. This variable is read by the Tesseract engine during initialization.

### Explicit Configuration Option

For programmatic control, supply the language-specific field when constructing the parser:

- **Rust**: `tessdata_path` in `LiteParseConfig`
- **Node.js/TypeScript**: `tessdataPath` in the constructor options
- **Python**: `tessdata_path` parameter in `LiteParse()`

These explicit settings override any environment variable.

### CLI Flag Override

When using the LiteParse CLI, the `--tessdata-path` argument takes highest precedence, as implemented in [`crates/liteparse/src/main.rs`](https://github.com/run-llama/liteparse/blob/main/crates/liteparse/src/main.rs) at lines 61-64. This flag directly sets the configuration field, superseding `TESSDATA_PREFIX`.

## Setting Up Local Tesseract Data for Offline Use

1. Download the required `*.traineddata` files for your target languages from the official Tesseract GitHub repository to a machine with internet access.
2. Transfer these files to your air-gapped environment and place them in a directory (e.g., `/opt/tessdata`).
3. Configure LiteParse to reference this directory using one of the methods documented in the following examples.

## Language-Specific Implementation Examples

### Command Line Interface

Set the environment variable for the session:

```bash
export TESSDATA_PREFIX=/opt/tessdata
lit parse scanned.pdf --ocr-language eng

```

Use a one-time override:

```bash
TESSDATA_PREFIX=/opt/tessdata lit parse scanned.pdf

```

Explicit flag usage (highest precedence):

```bash
lit parse scanned.pdf --tessdata-path /opt/tessdata --ocr-language eng

```

### Python

```python
from liteparse import LiteParse

parser = LiteParse(
    ocr_language="eng",
    tessdata_path="/opt/tessdata"  # Explicit path overrides TESSDATA_PREFIX

)

result = parser.parse("scanned.pdf")
print(result.text)

```

### Node.js / TypeScript

```typescript
import LiteParse from "@llamaindex/liteparse";

const parser = new LiteParse({
  ocrLanguage: "eng",
  tessdataPath: "/opt/tessdata",  // Overrides environment variable
});

const result = await parser.parse("scanned.pdf");
console.log(result.text);

```

### Rust

```rust
use liteparse::{LiteParse, LiteParseConfig};

let config = LiteParseConfig {
    tessdata_path: Some("/opt/tessdata".into()),
    ..Default::default()
};

let parser = LiteParse::new(config);
let res = parser.parse_file("scanned.pdf")?;
println!("{}", res.text);

```

## Summary

- **LiteParse TESSDATA_PREFIX** resolution follows strict precedence: explicit configuration first, then the `TESSDATA_PREFIX` environment variable, then the default directory (`~/.tesseract-rs/tessdata` or bundled `tessdata`).
- The resolution logic lives in [`crates/liteparse/src/ocr/tesseract.rs`](https://github.com/run-llama/liteparse/blob/main/crates/liteparse/src/ocr/tesseract.rs), while the configuration struct is defined in [`crates/liteparse/src/config.rs`](https://github.com/run-llama/liteparse/blob/main/crates/liteparse/src/config.rs).
- Use `--tessdata-path` in CLI mode, `tessdata_path` in Python, or `tessdataPath` in Node.js to override environment settings.
- For air-gapped deployments, ensure your local directory contains the required `*.traineddata` language files before running OCR operations.

## Frequently Asked Questions

### What file format belongs in the TESSDATA_PREFIX directory?

The directory must contain Tesseract `*.traineddata` files for each language you intend to recognize. These binary files contain trained neural network models for specific languages (e.g., `eng.traineddata` for English, `deu.traineddata` for German).

### Does LiteParse automatically download language data if TESSDATA_PREFIX is unset?

No. According to the implementation in [`crates/liteparse/src/ocr/tesseract.rs`](https://github.com/run-llama/liteparse/blob/main/crates/liteparse/src/ocr/tesseract.rs), if neither the configuration option nor `TESSDATA_PREFIX` is provided, LiteParse falls back to the default directory. It does not perform automatic downloads; the files must exist locally or the OCR operation will fail.

### Can I specify multiple languages when using TESSDATA_PREFIX in air-gapped environments?

Yes. Ensure your local tessdata directory contains the corresponding `.traineddata` files for each language (e.g., `eng.traineddata`, `fra.traineddata`, `spa.traineddata`), then pass multiple language codes to LiteParse. The engine loads all specified models from the local path defined by `TESSDATA_PREFIX`.

### How do I verify LiteParse is using my local tessdata directory?

Run a test document through the parser with your `TESSDATA_PREFIX` set and monitor for initialization errors. If Tesseract cannot locate the language data, it will fail immediately upon attempting OCR. Successful text extraction confirms the engine is loading `*.traineddata` files from your configured local path.