How to Integrate LiteParse with LangChain for Document Processing Pipelines

To integrate LiteParse with LangChain, create a custom DocumentLoader that calls the LiteParse parse method and maps each ParsedPage result to a LangChain Document object, preserving page content and geometric metadata.

LiteParse is a high-performance document parser written in Rust, developed under the run-llama organization. Its core extraction engine handles PDFs and other formats, exposing bindings for both Node.js and Python. Because LangChain expects standardized Document objects containing pageContent and metadata, you must bridge LiteParse’s Rust output—structured as ParsedPage and TextItem objects—to this interface. This guide demonstrates how to build these loaders using the actual source APIs found in crates/liteparse/src/parser.rs and the language-specific wrappers.

Understanding LiteParse’s Core Parsing Flow

Before writing the integration code, it is essential to understand how LiteParse processes documents. The orchestration happens in crates/liteparse/src/parser.rs, which drives the following pipeline:

  1. Input Handling: The parser accepts either a file path or raw bytes via PdfInput::Path or PdfInput::Bytes【parser.rs†L71-L75】.
  2. Format Conversion: Non-PDF inputs are automatically converted to PDF using external tools like LibreOffice or ImageMagick before extraction begins【parser.rs†L61-L65】.
  3. Text Extraction: The extract::extract_pages_from_input function pulls raw text items from the PDFium engine【parser.rs†L26-L34】.
  4. Selective OCR: If enabled in the configuration, pages are rendered to images (ocr_merge::render_pages_for_ocr) and processed by Tesseract or an HTTP OCR server【parser.rs†L41-L53】【parser.rs†L71-L79】.
  5. Grid Projection: The projection::project_pages_to_grid function reconstructs the spatial layout of text items to preserve reading order【parser.rs†L88-L90】.
  6. Result Assembly: The final output is a list of ParsedPage structs, each containing the full page text and a detailed textItems array with bounding-box coordinates【parser.rs†L17-L23】.

The Node.js bindings in packages/node/src/lib.ts expose this functionality through the LiteParse class, which wraps a native binary (native.LiteParse) configured via LiteParseNativeConfig【lib.ts†L67-L84】. The Python wrapper in packages/python/liteparse/parser.py provides an identical parse method for LangChain Python.

Building a LiteParse Document Loader in TypeScript

To integrate with LangChain.js, extend BaseDocumentLoader and implement the loadRaw method. This method instantiates the LiteParse class, invokes its async parse method, and transforms the results into LangChain Document instances.

// LiteParseLoader.ts
import { LiteParse, LiteParseConfig } from "@llamaindex/liteparse";
import { Document } from "@langchain/core/documents";
import { BaseDocumentLoader } from "@langchain/core/document_loaders/base";

export class LiteParseLoader extends BaseDocumentLoader {
  private readonly parser: LiteParse;

  constructor(userConfig: Partial<LiteParseConfig> = {}) {
    super();
    // Config defaults are handled in Rust (config.rs L40-L56)
    this.parser = new LiteParse(userConfig);
  }

  async loadRaw(file: string | Buffer): Promise<Document[]> {
    const result = await this.parser.parse(file);
    
    return result.pages.map((page) => {
      return new Document({
        pageContent: page.text,
        metadata: {
          pageNumber: page.pageNum,
          width: page.width,
          height: page.height,
          // Preserve raw text items for downstream layout-aware retrieval
          textItems: page.textItems,
        },
      });
    });
  }
}

Key implementation details:

  • The LiteParse constructor accepts a partial LiteParseConfig object; unset values default to the Rust struct definitions in crates/liteparse/src/config.rs【config.rs†L40-L56】.
  • Each Document includes the concatenated text as pageContent and attaches textItems—an array of objects containing text snippets and their geometric coordinates—to the metadata object.

Building a LiteParse Document Loader in Python

For LangChain Python, the implementation mirrors the TypeScript version. The liteparse Python package exposes the same core functionality.


# liteparse_loader.py

from liteparse import LiteParse
from langchain.schema import Document
from typing import List, Union
import pathlib

class LiteParseLoader:
    def __init__(self, **config):
        # Mirrors LiteParseConfig from config.rs

        self.parser = LiteParse(config)

    def load(self, path_or_bytes: Union[str, bytes, pathlib.Path]) -> List[Document]:
        result = self.parser.parse(path_or_bytes)
        docs = []
        for page in result.pages:
            docs.append(
                Document(
                    page_content=page.text,
                    metadata={
                        "page_number": page.page_num,
                        "width": page.width,
                        "height": page.height,
                        "text_items": page.text_items,
                    },
                )
            )
        return docs

Key implementation details:

  • The Python LiteParse class (defined in packages/python/liteparse/parser.py) accepts configuration keywords that map directly to the Rust LiteParseConfig struct.
  • The loader returns standard LangChain Document objects compatible with text splitters, embedding models, and vector stores.

Adding the Loader to a LangChain Pipeline

Once the loader is defined, you can insert it into any standard LangChain ingestion pipeline. Below are complete examples for both Node.js and Python that parse a PDF, split the pages into chunks, and store them in a vector database.

Node.js Pipeline Example

import { LiteParseLoader } from "./LiteParseLoader";
import { RecursiveCharacterTextSplitter } from "@langchain/text_splitters";
import { OpenAIEmbeddings } from "@langchain/openai";
import { MemoryVectorStore } from "@langchain/vectorstores/memory";

async function ingestDocument() {
  const loader = new LiteParseLoader({ ocrEnabled: true });
  const docs = await loader.loadRaw("contract.pdf");

  const splitter = new RecursiveCharacterTextSplitter({ 
    chunkSize: 1000, 
    chunkOverlap: 200 
  });
  const chunks = await splitter.splitDocuments(docs);

  const vectorStore = await MemoryVectorStore.fromDocuments(
    chunks, 
    new OpenAIEmbeddings()
  );
  
  // vectorStore is now ready for retrieval-augmented generation (RAG)
  return vectorStore;
}

Python Pipeline Example

from liteparse_loader import LiteParseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS

def ingest_document():
    loader = LiteParseLoader(ocr_enabled=True)
    documents = loader.load("contract.pdf")

    splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000, 
        chunk_overlap=200
    )
    chunks = splitter.split_documents(documents)

    vector_store = FAISS.from_documents(chunks, OpenAIEmbeddings())
    return vector_store.as_retriever()

Summary

Integrating LiteParse with LangChain requires building a thin wrapper that translates LiteParse’s Rust-native output into LangChain’s standard Document format.

  • LiteParse Core: The Rust engine in crates/liteparse/src/parser.rs handles extraction, optional OCR, and spatial grid projection.
  • Loader Pattern: In both Node.js and Python, extend LangChain’s BaseDocumentLoader (or follow its interface) to call LiteParse.parse().
  • Metadata Preservation: Map ParsedPage.text to pageContent and store page numbers, dimensions, and the raw textItems array in metadata for layout-aware applications.
  • Pipeline Compatibility: The resulting Document objects work seamlessly with LangChain text splitters, embedding models, and vector stores.

Frequently Asked Questions

What file formats does LiteParse support in a LangChain pipeline?

LiteParse natively handles PDFs via PdfInput::Bytes or PdfInput::Path【parser.rs†L71-L75】. For other formats—such as Word documents or images—the Rust core automatically converts them to PDF using external system tools like LibreOffice or ImageMagick before processing【parser.rs†L61-L65】. Your LangChain loader can therefore accept any format that LiteParse can convert.

How do I enable OCR for scanned documents?

Pass ocrEnabled: true (Node.js) or ocr_enabled=True (Python) when constructing the LiteParse instance. This triggers the selective OCR pipeline in crates/liteparse/src/parser.rs, which renders pages to images and processes them using either a local Tesseract instance or a configured HTTP OCR server【parser.rs†L41-L53】. The extracted text is then merged with the native PDF text before being returned to your LangChain loader.

What is the textItems metadata used for?

The textItems array (or text_items in Python) contains raw TextItem objects from the Rust struct defined in parser.rs【parser.rs†L17-L23】, including bounding box coordinates (x, y, width, height) for every text fragment. LangChain pipelines can use this data for visual question answering, highlighting source citations in the original PDF layout, or filtering chunks by spatial location on the page.

Does the Node.js binding use WASM or native binaries?

The Node.js package (packages/node/src/lib.ts) uses native binaries loaded via a thin wrapper in packages/node/src/native.ts【lib.ts†L67-L84】. This provides significantly faster parsing performance compared to WASM or pure JavaScript implementations, as it runs the Rust core directly on the host machine.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →