# How to Integrate LiteParse with LangChain for Document Processing Pipelines

> Learn how to integrate LiteParse with LangChain to build powerful document processing pipelines. Create a custom DocumentLoader preserving page content and metadata. Integrate LiteParse LangChain now.

- Repository: [LlamaIndex/liteparse](https://github.com/run-llama/liteparse)
- Tags: how-to-guide
- Published: 2026-05-30

---

**To integrate LiteParse with LangChain, create a custom `DocumentLoader` that calls the LiteParse `parse` method and maps each `ParsedPage` result to a LangChain `Document` object, preserving page content and geometric metadata.**

LiteParse is a high-performance document parser written in Rust, developed under the `run-llama` organization. Its core extraction engine handles PDFs and other formats, exposing bindings for both Node.js and Python. Because LangChain expects standardized `Document` objects containing `pageContent` and metadata, you must bridge LiteParse’s Rust output—structured as `ParsedPage` and `TextItem` objects—to this interface. This guide demonstrates how to build these loaders using the actual source APIs found in [`crates/liteparse/src/parser.rs`](https://github.com/run-llama/liteparse/blob/main/crates/liteparse/src/parser.rs) and the language-specific wrappers.

## Understanding LiteParse’s Core Parsing Flow

Before writing the integration code, it is essential to understand how LiteParse processes documents. The orchestration happens in [`crates/liteparse/src/parser.rs`](https://github.com/run-llama/liteparse/blob/main/crates/liteparse/src/parser.rs), which drives the following pipeline:

1.  **Input Handling**: The parser accepts either a file path or raw bytes via `PdfInput::Path` or `PdfInput::Bytes`【parser.rs†L71-L75】.
2.  **Format Conversion**: Non-PDF inputs are automatically converted to PDF using external tools like LibreOffice or ImageMagick before extraction begins【parser.rs†L61-L65】.
3.  **Text Extraction**: The `extract::extract_pages_from_input` function pulls raw text items from the PDFium engine【parser.rs†L26-L34】.
4.  **Selective OCR**: If enabled in the configuration, pages are rendered to images (`ocr_merge::render_pages_for_ocr`) and processed by Tesseract or an HTTP OCR server【parser.rs†L41-L53】【parser.rs†L71-L79】.
5.  **Grid Projection**: The `projection::project_pages_to_grid` function reconstructs the spatial layout of text items to preserve reading order【parser.rs†L88-L90】.
6.  **Result Assembly**: The final output is a list of `ParsedPage` structs, each containing the full page text and a detailed `textItems` array with bounding-box coordinates【parser.rs†L17-L23】.

The Node.js bindings in [`packages/node/src/lib.ts`](https://github.com/run-llama/liteparse/blob/main/packages/node/src/lib.ts) expose this functionality through the `LiteParse` class, which wraps a native binary (`native.LiteParse`) configured via `LiteParseNativeConfig`【lib.ts†L67-L84】. The Python wrapper in [`packages/python/liteparse/parser.py`](https://github.com/run-llama/liteparse/blob/main/packages/python/liteparse/parser.py) provides an identical `parse` method for LangChain Python.

## Building a LiteParse Document Loader in TypeScript

To integrate with LangChain.js, extend `BaseDocumentLoader` and implement the `loadRaw` method. This method instantiates the `LiteParse` class, invokes its async `parse` method, and transforms the results into LangChain `Document` instances.

```typescript
// LiteParseLoader.ts
import { LiteParse, LiteParseConfig } from "@llamaindex/liteparse";
import { Document } from "@langchain/core/documents";
import { BaseDocumentLoader } from "@langchain/core/document_loaders/base";

export class LiteParseLoader extends BaseDocumentLoader {
  private readonly parser: LiteParse;

  constructor(userConfig: Partial<LiteParseConfig> = {}) {
    super();
    // Config defaults are handled in Rust (config.rs L40-L56)
    this.parser = new LiteParse(userConfig);
  }

  async loadRaw(file: string | Buffer): Promise<Document[]> {
    const result = await this.parser.parse(file);
    
    return result.pages.map((page) => {
      return new Document({
        pageContent: page.text,
        metadata: {
          pageNumber: page.pageNum,
          width: page.width,
          height: page.height,
          // Preserve raw text items for downstream layout-aware retrieval
          textItems: page.textItems,
        },
      });
    });
  }
}

```

**Key implementation details**:
*   The `LiteParse` constructor accepts a partial `LiteParseConfig` object; unset values default to the Rust struct definitions in [`crates/liteparse/src/config.rs`](https://github.com/run-llama/liteparse/blob/main/crates/liteparse/src/config.rs)【config.rs†L40-L56】.
*   Each `Document` includes the concatenated `text` as `pageContent` and attaches `textItems`—an array of objects containing text snippets and their geometric coordinates—to the metadata object.

## Building a LiteParse Document Loader in Python

For LangChain Python, the implementation mirrors the TypeScript version. The `liteparse` Python package exposes the same core functionality.

```python

# liteparse_loader.py

from liteparse import LiteParse
from langchain.schema import Document
from typing import List, Union
import pathlib

class LiteParseLoader:
    def __init__(self, **config):
        # Mirrors LiteParseConfig from config.rs

        self.parser = LiteParse(config)

    def load(self, path_or_bytes: Union[str, bytes, pathlib.Path]) -> List[Document]:
        result = self.parser.parse(path_or_bytes)
        docs = []
        for page in result.pages:
            docs.append(
                Document(
                    page_content=page.text,
                    metadata={
                        "page_number": page.page_num,
                        "width": page.width,
                        "height": page.height,
                        "text_items": page.text_items,
                    },
                )
            )
        return docs

```

**Key implementation details**:
*   The Python `LiteParse` class (defined in [`packages/python/liteparse/parser.py`](https://github.com/run-llama/liteparse/blob/main/packages/python/liteparse/parser.py)) accepts configuration keywords that map directly to the Rust `LiteParseConfig` struct.
*   The loader returns standard LangChain `Document` objects compatible with text splitters, embedding models, and vector stores.

## Adding the Loader to a LangChain Pipeline

Once the loader is defined, you can insert it into any standard LangChain ingestion pipeline. Below are complete examples for both Node.js and Python that parse a PDF, split the pages into chunks, and store them in a vector database.

### Node.js Pipeline Example

```typescript
import { LiteParseLoader } from "./LiteParseLoader";
import { RecursiveCharacterTextSplitter } from "@langchain/text_splitters";
import { OpenAIEmbeddings } from "@langchain/openai";
import { MemoryVectorStore } from "@langchain/vectorstores/memory";

async function ingestDocument() {
  const loader = new LiteParseLoader({ ocrEnabled: true });
  const docs = await loader.loadRaw("contract.pdf");

  const splitter = new RecursiveCharacterTextSplitter({ 
    chunkSize: 1000, 
    chunkOverlap: 200 
  });
  const chunks = await splitter.splitDocuments(docs);

  const vectorStore = await MemoryVectorStore.fromDocuments(
    chunks, 
    new OpenAIEmbeddings()
  );
  
  // vectorStore is now ready for retrieval-augmented generation (RAG)
  return vectorStore;
}

```

### Python Pipeline Example

```python
from liteparse_loader import LiteParseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS

def ingest_document():
    loader = LiteParseLoader(ocr_enabled=True)
    documents = loader.load("contract.pdf")

    splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000, 
        chunk_overlap=200
    )
    chunks = splitter.split_documents(documents)

    vector_store = FAISS.from_documents(chunks, OpenAIEmbeddings())
    return vector_store.as_retriever()

```

## Summary

Integrating LiteParse with LangChain requires building a thin wrapper that translates LiteParse’s Rust-native output into LangChain’s standard `Document` format.

*   **LiteParse Core**: The Rust engine in [`crates/liteparse/src/parser.rs`](https://github.com/run-llama/liteparse/blob/main/crates/liteparse/src/parser.rs) handles extraction, optional OCR, and spatial grid projection.
*   **Loader Pattern**: In both Node.js and Python, extend LangChain’s `BaseDocumentLoader` (or follow its interface) to call `LiteParse.parse()`.
*   **Metadata Preservation**: Map `ParsedPage.text` to `pageContent` and store page numbers, dimensions, and the raw `textItems` array in metadata for layout-aware applications.
*   **Pipeline Compatibility**: The resulting `Document` objects work seamlessly with LangChain text splitters, embedding models, and vector stores.

## Frequently Asked Questions

### What file formats does LiteParse support in a LangChain pipeline?

LiteParse natively handles PDFs via `PdfInput::Bytes` or `PdfInput::Path`【parser.rs†L71-L75】. For other formats—such as Word documents or images—the Rust core automatically converts them to PDF using external system tools like LibreOffice or ImageMagick before processing【parser.rs†L61-L65】. Your LangChain loader can therefore accept any format that LiteParse can convert.

### How do I enable OCR for scanned documents?

Pass `ocrEnabled: true` (Node.js) or `ocr_enabled=True` (Python) when constructing the `LiteParse` instance. This triggers the selective OCR pipeline in [`crates/liteparse/src/parser.rs`](https://github.com/run-llama/liteparse/blob/main/crates/liteparse/src/parser.rs), which renders pages to images and processes them using either a local Tesseract instance or a configured HTTP OCR server【parser.rs†L41-L53】. The extracted text is then merged with the native PDF text before being returned to your LangChain loader.

### What is the `textItems` metadata used for?

The `textItems` array (or `text_items` in Python) contains raw `TextItem` objects from the Rust struct defined in [`parser.rs`](https://github.com/run-llama/liteparse/blob/main/parser.rs)【parser.rs†L17-L23】, including bounding box coordinates (`x`, `y`, `width`, `height`) for every text fragment. LangChain pipelines can use this data for visual question answering, highlighting source citations in the original PDF layout, or filtering chunks by spatial location on the page.

### Does the Node.js binding use WASM or native binaries?

The Node.js package ([`packages/node/src/lib.ts`](https://github.com/run-llama/liteparse/blob/main/packages/node/src/lib.ts)) uses native binaries loaded via a thin wrapper in [`packages/node/src/native.ts`](https://github.com/run-llama/liteparse/blob/main/packages/node/src/native.ts)【lib.ts†L67-L84】. This provides significantly faster parsing performance compared to WASM or pure JavaScript implementations, as it runs the Rust core directly on the host machine.