How to Configure LiteParse's `TESSDATA_PREFIX` for Offline OCR Environments
Set the TESSDATA_PREFIX environment variable to a local directory containing Tesseract .traineddata files, or pass the tessdata_path configuration option, to enable OCR in air‑gapped environments without network access.
LiteParse, the document parsing library from run‑llama/liteparse, embeds Tesseract for optical character recognition. By default, Tesseract attempts to download language models from the internet, which fails in offline or secure environments. Configuring the TESSDATA_PREFIX environment variable—or the equivalent tessdata_path option—ensures the engine loads language data from a local filesystem path.
How TESSDATA_PREFIX Resolution Works in LiteParse
LiteParse resolves the tessdata directory through a three‑step hierarchy implemented in crates/liteparse/src/ocr/tesseract.rs:
- Explicit configuration – If you supply a
tessdata_path(via CLI flag, constructor option, or config struct), that path is used immediately. - Environment variable – If no explicit path is set, LiteParse reads
std::env::var("TESSDATA_PREFIX"). - Fallback default – When neither source is present, the library calls
default_tessdata_dir()from thetesseract‑rscrate, which resolves to~/.tesseract‑rs/tessdataon Linux or~/Library/Application Support/tesseract‑rs/tessdataon macOS.
The chosen directory is passed directly to TesseractAPI::init(path, language). If the directory lacks the requested language’s .traineddata file, Tesseract returns an error.
Preparing Your Local Tessdata Directory
Before running offline OCR, populate your local directory with the required language packs:
- Download
.traineddatafiles from the official Tesseract tessdata repository or your organization’s approved mirror. - Place files in a directory such as
/opt/tessdataorC:\tessdata. - Verify the directory contains at minimum
eng.traineddatafor English, or the specific language codes you intend to use (e.g.,fra.traineddatafor French).
This setup guarantees offline safety (no network calls during OCR) and predictable versioning (you control exactly which model versions are active).
Configuration Methods by Interface
Shell Environment and CLI
Set the environment variable in your shell before invoking the lit CLI:
export TESSDATA_PREFIX=/opt/tessdata
lit parse document.pdf --ocr-enabled
To override the environment variable for a single invocation, use the --tessdata-path flag defined in crates/liteparse/src/main.rs:
lit parse document.pdf --tessdata-path /custom/tessdata --ocr-enabled
The CLI flag takes precedence over TESSDATA_PREFIX.
Node.js and TypeScript
When constructing the parser in Node.js or TypeScript, pass tessdataPath in the options object:
import { LiteParse } from "liteparse";
const parser = new LiteParse({
ocrEnabled: true,
ocrLanguage: "fra",
tessdataPath: "/opt/tessdata"
});
const result = await parser.parse("document.pdf");
console.log(result.text);
Python
In Python, use the tessdata_path parameter when instantiating LiteParse:
from liteparse import LiteParse
parser = LiteParse(
ocr_enabled=True,
ocr_language="eng",
tessdata_path="/opt/tessdata"
)
result = parser.parse("document.pdf")
print(result.text)
Rust Core
For Rust applications using liteparse directly, set the tessdata_path field on LiteParseConfig from crates/liteparse/src/config.rs:
use liteparse::{LiteParse, LiteParseConfig, OutputFormat};
let config = LiteParseConfig {
tessdata_path: Some("/opt/tessdata".to_string()),
..Default::default()
};
let parser = LiteParse::new(config);
let result = parser.parse("document.pdf").await?;
println!("{}", result.text);
If you omit tessdata_path, the library relies on the TESSDATA_PREFIX environment variable or the built‑in default.
Verifying Offline Operation
To confirm your configuration blocks network access:
- Disconnect from the network or run in an isolated container.
- Ensure your
TESSDATA_PREFIXdirectory contains the target.traineddatafiles. - Execute OCR on a scanned PDF.
If the operation succeeds without timeout errors or download attempts, LiteParse is correctly using the offline tessdata store.
Summary
- Resolution order: Explicit
tessdata_path>TESSDATA_PREFIXenvironment variable > default user directory (~/.tesseract‑rs/tessdataor macOS equivalent). - Cross‑binding consistency: The same environment variable works across Rust, Python, Node.js, and CLI interfaces because all bindings delegate to the core logic in
tesseract.rs. - Offline requirement: The specified directory must contain valid
.traineddatafiles for your target languages; otherwiseTesseractAPI::initreturns an initialization error. - Override capability: CLI flags and constructor options always take precedence over environment variables.
Frequently Asked Questions
What happens if I don't set TESSDATA_PREFIX?
If neither TESSDATA_PREFIX nor an explicit tessdata_path is provided, LiteParse falls back to the default location provided by the tesseract‑rs crate. On Linux this is typically ~/.tesseract‑rs/tessdata, and on macOS it is ~/Library/Application Support/tesseract‑rs/tessdata. If this directory lacks the required language data, OCR will fail.
Where can I download the .traineddata files for offline use?
Download language models from the official Tesseract tessdata repository hosted on GitHub, or from your organization’s internal artifact store. Place the files in the directory referenced by TESSDATA_PREFIX before running LiteParse.
Does the CLI --tessdata-path flag override the environment variable?
Yes. According to the implementation in crates/liteparse/src/main.rs, the --tessdata-path argument takes precedence over the TESSDATA_PREFIX environment variable. The hierarchy is: CLI argument > explicit config > environment variable > default directory.
Is TESSDATA_PREFIX supported on Windows?
Yes. LiteParse reads the environment variable using std::env::var, which works across all platforms. On Windows, set the variable using set TESSDATA_PREFIX=C:\path\to\tessdata in Command Prompt or PowerShell before running your LiteParse application.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →