How to Parse PDFs from In-Memory Bytes in LiteParse: A Complete Guide

To parse PDFs from memory in LiteParse, pass a Vec<u8> buffer to the parse_input method using the PdfInput::Bytes variant in Rust, or use language-specific helpers like parse() with Buffer/Uint8Array in Node.js and parse_bytes() in Python.

LiteParse, the high-performance PDF parsing library from the run-llama/liteparse repository, abstracts file inputs through a unified PdfInput type. While the high-level API accepts file paths, the library provides a direct path for handling PDF data already loaded in memory. This capability is essential for applications processing uploads from HTTP requests, reading from object storage, or working with encrypted document streams.

Understanding the PdfInput Abstraction

The core of LiteParse's input handling resides in crates/liteparse/src/types.rs, which defines the PdfInput enum. This abstraction allows the parser to treat file paths and memory buffers interchangeably:

pub enum PdfInput {
    /// Path to a PDF file on disk.
    Path(String),
    /// In-memory PDF data.
    Bytes(Vec<u8>),
}

When you call the convenience method LiteParse::parse(&str), the library internally constructs a PdfInput::Path. To process data already in memory, you must bypass this helper and interact with the lower-level parse_input method directly.

Parsing In-Memory PDFs in Rust

In Rust, parsing PDFs from memory requires calling the async parse_input method with the Bytes variant.

Using parse_input with PdfInput::Bytes

First, ensure your PDF data exists as a `Vec`. This can originate from file reads, network requests, or generated content. Then instantiate LiteParse with your configuration and await the parse_input call:

use liteparse::config::LiteParseConfig;
use liteparse::parser::{LiteParse, PdfInput};

#[tokio::main]
async fn main() -> Result<(), liteparse::error::LiteParseError> {
    // Example: Loading PDF into memory from disk
    // (In production, this could be from reqwest, s3, etc.)
    let pdf_bytes = std::fs::read("document.pdf")?;

    let parser = LiteParse::new(LiteParseConfig::default());
    
    // Pass the bytes directly to parse_input
    let result = parser
        .parse_input(PdfInput::Bytes(pdf_bytes))
        .await?;

    println!("Successfully parsed {} pages", result.pages.len());
    Ok(())
}

The parse_input method in crates/liteparse/src/parser.rs validates the input, initializes the PDFium backend, and executes the full processing pipeline—including OCR and layout analysis—without writing temporary files.

Language Bindings for In-Memory Parsing

LiteParse's language bindings automatically handle the conversion to PdfInput::Bytes, providing idiomatic interfaces for each runtime.

Node.js and TypeScript

The Node.js binding defined in packages/node/src/lib.ts accepts Buffer or Uint8Array objects through the standard parse method. The TypeScript type definition declares:

type LiteParseInput = string | Buffer | Uint8Array;

When you pass a Buffer, the binding automatically routes to the Bytes variant:

import { LiteParse } from "liteparse";
import { readFile } from "fs/promises";

async function parseBuffer() {
  const pdfBuffer = await readFile("document.pdf");
  
  const parser = new LiteParse();
  const result = await parser.parse(pdfBuffer); // Automatically uses Bytes variant
  
  console.log(`Parsed ${result.pages.length} pages`);
}

Python

The Python binding in packages/python/liteparse/parser.py exposes the dedicated parse_bytes() method for memory-based parsing. This method accepts Python bytes objects:

from liteparse import LiteParse
from pathlib import Path

def parse_from_memory(file_path: Path):
    # Read into bytes (could also be from urllib, boto3, etc.)

    pdf_bytes = file_path.read_bytes()
    
    parser = LiteParse()  # Uses default configuration

    result = parser.parse_bytes(pdf_bytes)  # Directly calls native parse_bytes

    
    print(f"Extracted {len(result.pages)} pages from memory buffer")

How the Byte Parsing Pipeline Works

Using PdfInput::Bytes triggers the same unified processing pipeline as file-based inputs. The implementation in crates/liteparse/src/parser.rs ensures:

  • Zero-copy operations: The Vec<u8> buffer passes directly to PDFium without serialization overhead or temporary file creation.
  • Consistent configuration: LiteParseConfig affects parsing identically regardless of input source, ensuring OCR settings, DPI targets, and extraction modes behave predictably.
  • Async execution: The parse_input method is async across all bindings, preventing blocking of the main thread during PDFium initialization and text layout projection.

Summary

  • Abstract input type: LiteParse uses the PdfInput enum in crates/liteparse/src/types.rs to handle both Path(String) and Bytes(Vec<u8>) variants.
  • Rust implementation: Call parser.parse_input(PdfInput::Bytes(buffer)).await directly instead of the high-level parse(&str) method.
  • Language convenience: Node.js accepts Buffer and Uint8Array through the standard parse() method, while Python provides parse_bytes() for bytes objects.
  • Performance characteristics: In-memory parsing avoids disk I/O and temporary file creation, with PDFium reading directly from the provided buffer.
  • Source locations: Key files include crates/liteparse/src/types.rs (enum definition), crates/liteparse/src/parser.rs (implementation), and language-specific wrappers in packages/node/src/lib.ts and packages/python/liteparse/parser.py.

Frequently Asked Questions

Does LiteParse support streaming PDF data?

No, LiteParse requires the complete PDF data to be available in memory as a contiguous Vec<u8> buffer before parsing begins. The current architecture in crates/liteparse/src/types.rs defines PdfInput::Bytes(Vec<u8>), which necessitates fully materialized data rather than streams or generators.

Is there a size limit for in-memory PDF parsing?

LiteParse imposes no explicit size limits within the core Rust library; the constraint depends on your system's available RAM and the memory limits of the PDFium rendering engine. However, the Vec<u8> structure used in PdfInput::Bytes is limited by Rust's allocator constraints (typically usize::MAX on 64-bit systems).

Can I parse PDFs directly from HTTP requests without saving to disk?

Yes. In Rust, read the request body into a Vec<u8> using libraries like reqwest or hyper, then pass that buffer to parse_input(PdfInput::Bytes(...)). In Python, pass response.content directly to parser.parse_bytes(). In Node.js, pass the response Buffer directly to parser.parse().

Does parsing from bytes produce different results than parsing from file paths?

No. Both input methods funnel into the identical processing pipeline implemented in crates/liteparse/src/parser.rs. The parse method (for paths) is simply a convenience wrapper that constructs PdfInput::Path and calls parse_input. Text extraction accuracy, OCR behavior, and layout projection remain identical regardless of the input source.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →