How to Use LiteParse TESSDATA_PREFIX for Air-Gapped OCR Environments
Set the TESSDATA_PREFIX environment variable or pass the tessdata_path configuration option to point LiteParse at a local directory containing Tesseract *.traineddata files, enabling fully offline text recognition without internet access.
LiteParse, the open-source document parsing library from the run-llama/liteparse repository, performs optical character recognition through a pluggable OcrEngine trait backed by Tesseract. In secure or air-gapped environments without internet access, you must configure LiteParse TESSDATA_PREFIX handling to reference locally stored language model files rather than relying on automatic downloads or default system paths.
How LiteParse Resolves TESSDATA_PREFIX
According to the source code in crates/liteparse/src/ocr/tesseract.rs, the Tesseract engine resolves the tessdata directory through a strict precedence hierarchy. The recognize method implements this logic at lines 60-67:
let tessdata_path = self
.tessdata_path
.clone()
.or_else(|| std::env::var("TESSDATA_PREFIX").ok());
This means LiteParse checks three sources in order:
- Explicit configuration – the
tessdata_pathfield set programmatically. - Environment variable –
TESSDATA_PREFIX. - Built-in default – provided by
default_tessdata_dir()in the same file, which resolves to~/.tesseract-rs/tessdataon Linux/macOS or a bundled"tessdata"folder on other platforms.
The LiteParseConfig struct in crates/liteparse/src/config.rs documents this fallback behavior in its field comments at lines 12-14.
Configuration Methods for Air-Gapped OCR
Environment Variable (TESSDATA_PREFIX)
The simplest method for air-gapped deployments is exporting TESSDATA_PREFIX to the absolute path of your local tessdata directory. This variable is read by the Tesseract engine during initialization.
Explicit Configuration Option
For programmatic control, supply the language-specific field when constructing the parser:
- Rust:
tessdata_pathinLiteParseConfig - Node.js/TypeScript:
tessdataPathin the constructor options - Python:
tessdata_pathparameter inLiteParse()
These explicit settings override any environment variable.
CLI Flag Override
When using the LiteParse CLI, the --tessdata-path argument takes highest precedence, as implemented in crates/liteparse/src/main.rs at lines 61-64. This flag directly sets the configuration field, superseding TESSDATA_PREFIX.
Setting Up Local Tesseract Data for Offline Use
- Download the required
*.traineddatafiles for your target languages from the official Tesseract GitHub repository to a machine with internet access. - Transfer these files to your air-gapped environment and place them in a directory (e.g.,
/opt/tessdata). - Configure LiteParse to reference this directory using one of the methods documented in the following examples.
Language-Specific Implementation Examples
Command Line Interface
Set the environment variable for the session:
export TESSDATA_PREFIX=/opt/tessdata
lit parse scanned.pdf --ocr-language eng
Use a one-time override:
TESSDATA_PREFIX=/opt/tessdata lit parse scanned.pdf
Explicit flag usage (highest precedence):
lit parse scanned.pdf --tessdata-path /opt/tessdata --ocr-language eng
Python
from liteparse import LiteParse
parser = LiteParse(
ocr_language="eng",
tessdata_path="/opt/tessdata" # Explicit path overrides TESSDATA_PREFIX
)
result = parser.parse("scanned.pdf")
print(result.text)
Node.js / TypeScript
import LiteParse from "@llamaindex/liteparse";
const parser = new LiteParse({
ocrLanguage: "eng",
tessdataPath: "/opt/tessdata", // Overrides environment variable
});
const result = await parser.parse("scanned.pdf");
console.log(result.text);
Rust
use liteparse::{LiteParse, LiteParseConfig};
let config = LiteParseConfig {
tessdata_path: Some("/opt/tessdata".into()),
..Default::default()
};
let parser = LiteParse::new(config);
let res = parser.parse_file("scanned.pdf")?;
println!("{}", res.text);
Summary
- LiteParse TESSDATA_PREFIX resolution follows strict precedence: explicit configuration first, then the
TESSDATA_PREFIXenvironment variable, then the default directory (~/.tesseract-rs/tessdataor bundledtessdata). - The resolution logic lives in
crates/liteparse/src/ocr/tesseract.rs, while the configuration struct is defined incrates/liteparse/src/config.rs. - Use
--tessdata-pathin CLI mode,tessdata_pathin Python, ortessdataPathin Node.js to override environment settings. - For air-gapped deployments, ensure your local directory contains the required
*.traineddatalanguage files before running OCR operations.
Frequently Asked Questions
What file format belongs in the TESSDATA_PREFIX directory?
The directory must contain Tesseract *.traineddata files for each language you intend to recognize. These binary files contain trained neural network models for specific languages (e.g., eng.traineddata for English, deu.traineddata for German).
Does LiteParse automatically download language data if TESSDATA_PREFIX is unset?
No. According to the implementation in crates/liteparse/src/ocr/tesseract.rs, if neither the configuration option nor TESSDATA_PREFIX is provided, LiteParse falls back to the default directory. It does not perform automatic downloads; the files must exist locally or the OCR operation will fail.
Can I specify multiple languages when using TESSDATA_PREFIX in air-gapped environments?
Yes. Ensure your local tessdata directory contains the corresponding .traineddata files for each language (e.g., eng.traineddata, fra.traineddata, spa.traineddata), then pass multiple language codes to LiteParse. The engine loads all specified models from the local path defined by TESSDATA_PREFIX.
How do I verify LiteParse is using my local tessdata directory?
Run a test document through the parser with your TESSDATA_PREFIX set and monitor for initialization errors. If Tesseract cannot locate the language data, it will fail immediately upon attempting OCR. Successful text extraction confirms the engine is loading *.traineddata files from your configured local path.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →