How to Parse PDFs from In-Memory Bytes in LiteParse: A Complete Guide
To parse PDFs from memory in LiteParse, pass a Vec<u8> buffer to the parse_input method using the PdfInput::Bytes variant in Rust, or use language-specific helpers like parse() with Buffer/Uint8Array in Node.js and parse_bytes() in Python.
LiteParse, the high-performance PDF parsing library from the run-llama/liteparse repository, abstracts file inputs through a unified PdfInput type. While the high-level API accepts file paths, the library provides a direct path for handling PDF data already loaded in memory. This capability is essential for applications processing uploads from HTTP requests, reading from object storage, or working with encrypted document streams.
Understanding the PdfInput Abstraction
The core of LiteParse's input handling resides in crates/liteparse/src/types.rs, which defines the PdfInput enum. This abstraction allows the parser to treat file paths and memory buffers interchangeably:
pub enum PdfInput {
/// Path to a PDF file on disk.
Path(String),
/// In-memory PDF data.
Bytes(Vec<u8>),
}
When you call the convenience method LiteParse::parse(&str), the library internally constructs a PdfInput::Path. To process data already in memory, you must bypass this helper and interact with the lower-level parse_input method directly.
Parsing In-Memory PDFs in Rust
In Rust, parsing PDFs from memory requires calling the async parse_input method with the Bytes variant.
Using parse_input with PdfInput::Bytes
First, ensure your PDF data exists as a `Vec`. This can originate from file reads, network requests, or generated content. Then instantiate LiteParse with your configuration and await the parse_input call:
use liteparse::config::LiteParseConfig;
use liteparse::parser::{LiteParse, PdfInput};
#[tokio::main]
async fn main() -> Result<(), liteparse::error::LiteParseError> {
// Example: Loading PDF into memory from disk
// (In production, this could be from reqwest, s3, etc.)
let pdf_bytes = std::fs::read("document.pdf")?;
let parser = LiteParse::new(LiteParseConfig::default());
// Pass the bytes directly to parse_input
let result = parser
.parse_input(PdfInput::Bytes(pdf_bytes))
.await?;
println!("Successfully parsed {} pages", result.pages.len());
Ok(())
}
The parse_input method in crates/liteparse/src/parser.rs validates the input, initializes the PDFium backend, and executes the full processing pipeline—including OCR and layout analysis—without writing temporary files.
Language Bindings for In-Memory Parsing
LiteParse's language bindings automatically handle the conversion to PdfInput::Bytes, providing idiomatic interfaces for each runtime.
Node.js and TypeScript
The Node.js binding defined in packages/node/src/lib.ts accepts Buffer or Uint8Array objects through the standard parse method. The TypeScript type definition declares:
type LiteParseInput = string | Buffer | Uint8Array;
When you pass a Buffer, the binding automatically routes to the Bytes variant:
import { LiteParse } from "liteparse";
import { readFile } from "fs/promises";
async function parseBuffer() {
const pdfBuffer = await readFile("document.pdf");
const parser = new LiteParse();
const result = await parser.parse(pdfBuffer); // Automatically uses Bytes variant
console.log(`Parsed ${result.pages.length} pages`);
}
Python
The Python binding in packages/python/liteparse/parser.py exposes the dedicated parse_bytes() method for memory-based parsing. This method accepts Python bytes objects:
from liteparse import LiteParse
from pathlib import Path
def parse_from_memory(file_path: Path):
# Read into bytes (could also be from urllib, boto3, etc.)
pdf_bytes = file_path.read_bytes()
parser = LiteParse() # Uses default configuration
result = parser.parse_bytes(pdf_bytes) # Directly calls native parse_bytes
print(f"Extracted {len(result.pages)} pages from memory buffer")
How the Byte Parsing Pipeline Works
Using PdfInput::Bytes triggers the same unified processing pipeline as file-based inputs. The implementation in crates/liteparse/src/parser.rs ensures:
- Zero-copy operations: The
Vec<u8>buffer passes directly to PDFium without serialization overhead or temporary file creation. - Consistent configuration:
LiteParseConfigaffects parsing identically regardless of input source, ensuring OCR settings, DPI targets, and extraction modes behave predictably. - Async execution: The
parse_inputmethod is async across all bindings, preventing blocking of the main thread during PDFium initialization and text layout projection.
Summary
- Abstract input type: LiteParse uses the
PdfInputenum incrates/liteparse/src/types.rsto handle bothPath(String)andBytes(Vec<u8>)variants. - Rust implementation: Call
parser.parse_input(PdfInput::Bytes(buffer)).awaitdirectly instead of the high-levelparse(&str)method. - Language convenience: Node.js accepts
BufferandUint8Arraythrough the standardparse()method, while Python providesparse_bytes()forbytesobjects. - Performance characteristics: In-memory parsing avoids disk I/O and temporary file creation, with PDFium reading directly from the provided buffer.
- Source locations: Key files include
crates/liteparse/src/types.rs(enum definition),crates/liteparse/src/parser.rs(implementation), and language-specific wrappers inpackages/node/src/lib.tsandpackages/python/liteparse/parser.py.
Frequently Asked Questions
Does LiteParse support streaming PDF data?
No, LiteParse requires the complete PDF data to be available in memory as a contiguous Vec<u8> buffer before parsing begins. The current architecture in crates/liteparse/src/types.rs defines PdfInput::Bytes(Vec<u8>), which necessitates fully materialized data rather than streams or generators.
Is there a size limit for in-memory PDF parsing?
LiteParse imposes no explicit size limits within the core Rust library; the constraint depends on your system's available RAM and the memory limits of the PDFium rendering engine. However, the Vec<u8> structure used in PdfInput::Bytes is limited by Rust's allocator constraints (typically usize::MAX on 64-bit systems).
Can I parse PDFs directly from HTTP requests without saving to disk?
Yes. In Rust, read the request body into a Vec<u8> using libraries like reqwest or hyper, then pass that buffer to parse_input(PdfInput::Bytes(...)). In Python, pass response.content directly to parser.parse_bytes(). In Node.js, pass the response Buffer directly to parser.parse().
Does parsing from bytes produce different results than parsing from file paths?
No. Both input methods funnel into the identical processing pipeline implemented in crates/liteparse/src/parser.rs. The parse method (for paths) is simply a convenience wrapper that constructs PdfInput::Path and calls parse_input. Text extraction accuracy, OCR behavior, and layout projection remain identical regardless of the input source.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →