How to Configure LiteParse's `TESSDATA_PREFIX` for Offline OCR Environments

Set the TESSDATA_PREFIX environment variable to a local directory containing Tesseract .traineddata files, or pass the tessdata_path configuration option, to enable OCR in air‑gapped environments without network access.

LiteParse, the document parsing library from run‑llama/liteparse, embeds Tesseract for optical character recognition. By default, Tesseract attempts to download language models from the internet, which fails in offline or secure environments. Configuring the TESSDATA_PREFIX environment variable—or the equivalent tessdata_path option—ensures the engine loads language data from a local filesystem path.

How TESSDATA_PREFIX Resolution Works in LiteParse

LiteParse resolves the tessdata directory through a three‑step hierarchy implemented in crates/liteparse/src/ocr/tesseract.rs:

  1. Explicit configuration – If you supply a tessdata_path (via CLI flag, constructor option, or config struct), that path is used immediately.
  2. Environment variable – If no explicit path is set, LiteParse reads std::env::var("TESSDATA_PREFIX").
  3. Fallback default – When neither source is present, the library calls default_tessdata_dir() from the tesseract‑rs crate, which resolves to ~/.tesseract‑rs/tessdata on Linux or ~/Library/Application Support/tesseract‑rs/tessdata on macOS.

The chosen directory is passed directly to TesseractAPI::init(path, language). If the directory lacks the requested language’s .traineddata file, Tesseract returns an error.

Preparing Your Local Tessdata Directory

Before running offline OCR, populate your local directory with the required language packs:

  • Download .traineddata files from the official Tesseract tessdata repository or your organization’s approved mirror.
  • Place files in a directory such as /opt/tessdata or C:\tessdata.
  • Verify the directory contains at minimum eng.traineddata for English, or the specific language codes you intend to use (e.g., fra.traineddata for French).

This setup guarantees offline safety (no network calls during OCR) and predictable versioning (you control exactly which model versions are active).

Configuration Methods by Interface

Shell Environment and CLI

Set the environment variable in your shell before invoking the lit CLI:

export TESSDATA_PREFIX=/opt/tessdata
lit parse document.pdf --ocr-enabled

To override the environment variable for a single invocation, use the --tessdata-path flag defined in crates/liteparse/src/main.rs:

lit parse document.pdf --tessdata-path /custom/tessdata --ocr-enabled

The CLI flag takes precedence over TESSDATA_PREFIX.

Node.js and TypeScript

When constructing the parser in Node.js or TypeScript, pass tessdataPath in the options object:

import { LiteParse } from "liteparse";

const parser = new LiteParse({
  ocrEnabled: true,
  ocrLanguage: "fra",
  tessdataPath: "/opt/tessdata"
});

const result = await parser.parse("document.pdf");
console.log(result.text);

Python

In Python, use the tessdata_path parameter when instantiating LiteParse:

from liteparse import LiteParse

parser = LiteParse(
    ocr_enabled=True,
    ocr_language="eng",
    tessdata_path="/opt/tessdata"
)

result = parser.parse("document.pdf")
print(result.text)

Rust Core

For Rust applications using liteparse directly, set the tessdata_path field on LiteParseConfig from crates/liteparse/src/config.rs:

use liteparse::{LiteParse, LiteParseConfig, OutputFormat};

let config = LiteParseConfig {
    tessdata_path: Some("/opt/tessdata".to_string()),
    ..Default::default()
};

let parser = LiteParse::new(config);
let result = parser.parse("document.pdf").await?;
println!("{}", result.text);

If you omit tessdata_path, the library relies on the TESSDATA_PREFIX environment variable or the built‑in default.

Verifying Offline Operation

To confirm your configuration blocks network access:

  1. Disconnect from the network or run in an isolated container.
  2. Ensure your TESSDATA_PREFIX directory contains the target .traineddata files.
  3. Execute OCR on a scanned PDF.

If the operation succeeds without timeout errors or download attempts, LiteParse is correctly using the offline tessdata store.

Summary

  • Resolution order: Explicit tessdata_path > TESSDATA_PREFIX environment variable > default user directory (~/.tesseract‑rs/tessdata or macOS equivalent).
  • Cross‑binding consistency: The same environment variable works across Rust, Python, Node.js, and CLI interfaces because all bindings delegate to the core logic in tesseract.rs.
  • Offline requirement: The specified directory must contain valid .traineddata files for your target languages; otherwise TesseractAPI::init returns an initialization error.
  • Override capability: CLI flags and constructor options always take precedence over environment variables.

Frequently Asked Questions

What happens if I don't set TESSDATA_PREFIX?

If neither TESSDATA_PREFIX nor an explicit tessdata_path is provided, LiteParse falls back to the default location provided by the tesseract‑rs crate. On Linux this is typically ~/.tesseract‑rs/tessdata, and on macOS it is ~/Library/Application Support/tesseract‑rs/tessdata. If this directory lacks the required language data, OCR will fail.

Where can I download the .traineddata files for offline use?

Download language models from the official Tesseract tessdata repository hosted on GitHub, or from your organization’s internal artifact store. Place the files in the directory referenced by TESSDATA_PREFIX before running LiteParse.

Does the CLI --tessdata-path flag override the environment variable?

Yes. According to the implementation in crates/liteparse/src/main.rs, the --tessdata-path argument takes precedence over the TESSDATA_PREFIX environment variable. The hierarchy is: CLI argument > explicit config > environment variable > default directory.

Is TESSDATA_PREFIX supported on Windows?

Yes. LiteParse reads the environment variable using std::env::var, which works across all platforms. On Windows, set the variable using set TESSDATA_PREFIX=C:\path\to\tessdata in Command Prompt or PowerShell before running your LiteParse application.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →