How to Integrate LiteParse with LangChain for Document Processing Pipelines
To integrate LiteParse with LangChain, create a custom DocumentLoader that calls the LiteParse parse method and maps each ParsedPage result to a LangChain Document object, preserving page content and geometric metadata.
LiteParse is a high-performance document parser written in Rust, developed under the run-llama organization. Its core extraction engine handles PDFs and other formats, exposing bindings for both Node.js and Python. Because LangChain expects standardized Document objects containing pageContent and metadata, you must bridge LiteParse’s Rust output—structured as ParsedPage and TextItem objects—to this interface. This guide demonstrates how to build these loaders using the actual source APIs found in crates/liteparse/src/parser.rs and the language-specific wrappers.
Understanding LiteParse’s Core Parsing Flow
Before writing the integration code, it is essential to understand how LiteParse processes documents. The orchestration happens in crates/liteparse/src/parser.rs, which drives the following pipeline:
- Input Handling: The parser accepts either a file path or raw bytes via
PdfInput::PathorPdfInput::Bytes【parser.rs†L71-L75】. - Format Conversion: Non-PDF inputs are automatically converted to PDF using external tools like LibreOffice or ImageMagick before extraction begins【parser.rs†L61-L65】.
- Text Extraction: The
extract::extract_pages_from_inputfunction pulls raw text items from the PDFium engine【parser.rs†L26-L34】. - Selective OCR: If enabled in the configuration, pages are rendered to images (
ocr_merge::render_pages_for_ocr) and processed by Tesseract or an HTTP OCR server【parser.rs†L41-L53】【parser.rs†L71-L79】. - Grid Projection: The
projection::project_pages_to_gridfunction reconstructs the spatial layout of text items to preserve reading order【parser.rs†L88-L90】. - Result Assembly: The final output is a list of
ParsedPagestructs, each containing the full page text and a detailedtextItemsarray with bounding-box coordinates【parser.rs†L17-L23】.
The Node.js bindings in packages/node/src/lib.ts expose this functionality through the LiteParse class, which wraps a native binary (native.LiteParse) configured via LiteParseNativeConfig【lib.ts†L67-L84】. The Python wrapper in packages/python/liteparse/parser.py provides an identical parse method for LangChain Python.
Building a LiteParse Document Loader in TypeScript
To integrate with LangChain.js, extend BaseDocumentLoader and implement the loadRaw method. This method instantiates the LiteParse class, invokes its async parse method, and transforms the results into LangChain Document instances.
// LiteParseLoader.ts
import { LiteParse, LiteParseConfig } from "@llamaindex/liteparse";
import { Document } from "@langchain/core/documents";
import { BaseDocumentLoader } from "@langchain/core/document_loaders/base";
export class LiteParseLoader extends BaseDocumentLoader {
private readonly parser: LiteParse;
constructor(userConfig: Partial<LiteParseConfig> = {}) {
super();
// Config defaults are handled in Rust (config.rs L40-L56)
this.parser = new LiteParse(userConfig);
}
async loadRaw(file: string | Buffer): Promise<Document[]> {
const result = await this.parser.parse(file);
return result.pages.map((page) => {
return new Document({
pageContent: page.text,
metadata: {
pageNumber: page.pageNum,
width: page.width,
height: page.height,
// Preserve raw text items for downstream layout-aware retrieval
textItems: page.textItems,
},
});
});
}
}
Key implementation details:
- The
LiteParseconstructor accepts a partialLiteParseConfigobject; unset values default to the Rust struct definitions incrates/liteparse/src/config.rs【config.rs†L40-L56】. - Each
Documentincludes the concatenatedtextaspageContentand attachestextItems—an array of objects containing text snippets and their geometric coordinates—to the metadata object.
Building a LiteParse Document Loader in Python
For LangChain Python, the implementation mirrors the TypeScript version. The liteparse Python package exposes the same core functionality.
# liteparse_loader.py
from liteparse import LiteParse
from langchain.schema import Document
from typing import List, Union
import pathlib
class LiteParseLoader:
def __init__(self, **config):
# Mirrors LiteParseConfig from config.rs
self.parser = LiteParse(config)
def load(self, path_or_bytes: Union[str, bytes, pathlib.Path]) -> List[Document]:
result = self.parser.parse(path_or_bytes)
docs = []
for page in result.pages:
docs.append(
Document(
page_content=page.text,
metadata={
"page_number": page.page_num,
"width": page.width,
"height": page.height,
"text_items": page.text_items,
},
)
)
return docs
Key implementation details:
- The Python
LiteParseclass (defined inpackages/python/liteparse/parser.py) accepts configuration keywords that map directly to the RustLiteParseConfigstruct. - The loader returns standard LangChain
Documentobjects compatible with text splitters, embedding models, and vector stores.
Adding the Loader to a LangChain Pipeline
Once the loader is defined, you can insert it into any standard LangChain ingestion pipeline. Below are complete examples for both Node.js and Python that parse a PDF, split the pages into chunks, and store them in a vector database.
Node.js Pipeline Example
import { LiteParseLoader } from "./LiteParseLoader";
import { RecursiveCharacterTextSplitter } from "@langchain/text_splitters";
import { OpenAIEmbeddings } from "@langchain/openai";
import { MemoryVectorStore } from "@langchain/vectorstores/memory";
async function ingestDocument() {
const loader = new LiteParseLoader({ ocrEnabled: true });
const docs = await loader.loadRaw("contract.pdf");
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000,
chunkOverlap: 200
});
const chunks = await splitter.splitDocuments(docs);
const vectorStore = await MemoryVectorStore.fromDocuments(
chunks,
new OpenAIEmbeddings()
);
// vectorStore is now ready for retrieval-augmented generation (RAG)
return vectorStore;
}
Python Pipeline Example
from liteparse_loader import LiteParseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS
def ingest_document():
loader = LiteParseLoader(ocr_enabled=True)
documents = loader.load("contract.pdf")
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200
)
chunks = splitter.split_documents(documents)
vector_store = FAISS.from_documents(chunks, OpenAIEmbeddings())
return vector_store.as_retriever()
Summary
Integrating LiteParse with LangChain requires building a thin wrapper that translates LiteParse’s Rust-native output into LangChain’s standard Document format.
- LiteParse Core: The Rust engine in
crates/liteparse/src/parser.rshandles extraction, optional OCR, and spatial grid projection. - Loader Pattern: In both Node.js and Python, extend LangChain’s
BaseDocumentLoader(or follow its interface) to callLiteParse.parse(). - Metadata Preservation: Map
ParsedPage.texttopageContentand store page numbers, dimensions, and the rawtextItemsarray in metadata for layout-aware applications. - Pipeline Compatibility: The resulting
Documentobjects work seamlessly with LangChain text splitters, embedding models, and vector stores.
Frequently Asked Questions
What file formats does LiteParse support in a LangChain pipeline?
LiteParse natively handles PDFs via PdfInput::Bytes or PdfInput::Path【parser.rs†L71-L75】. For other formats—such as Word documents or images—the Rust core automatically converts them to PDF using external system tools like LibreOffice or ImageMagick before processing【parser.rs†L61-L65】. Your LangChain loader can therefore accept any format that LiteParse can convert.
How do I enable OCR for scanned documents?
Pass ocrEnabled: true (Node.js) or ocr_enabled=True (Python) when constructing the LiteParse instance. This triggers the selective OCR pipeline in crates/liteparse/src/parser.rs, which renders pages to images and processes them using either a local Tesseract instance or a configured HTTP OCR server【parser.rs†L41-L53】. The extracted text is then merged with the native PDF text before being returned to your LangChain loader.
What is the textItems metadata used for?
The textItems array (or text_items in Python) contains raw TextItem objects from the Rust struct defined in parser.rs【parser.rs†L17-L23】, including bounding box coordinates (x, y, width, height) for every text fragment. LangChain pipelines can use this data for visual question answering, highlighting source citations in the original PDF layout, or filtering chunks by spatial location on the page.
Does the Node.js binding use WASM or native binaries?
The Node.js package (packages/node/src/lib.ts) uses native binaries loaded via a thin wrapper in packages/node/src/native.ts【lib.ts†L67-L84】. This provides significantly faster parsing performance compared to WASM or pure JavaScript implementations, as it runs the Rust core directly on the host machine.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →