# How to Parse PDFs from In-Memory Bytes in LiteParse: A Complete Guide

> Parse PDFs directly from in-memory bytes in LiteParse. Learn how to use `PdfInput::Bytes` in Rust or language-specific helpers for seamless byte parsing.

- Repository: [LlamaIndex/liteparse](https://github.com/run-llama/liteparse)
- Tags: how-to-guide
- Published: 2026-05-31

---

**To parse PDFs from memory in LiteParse, pass a `Vec<u8>` buffer to the `parse_input` method using the `PdfInput::Bytes` variant in Rust, or use language-specific helpers like `parse()` with `Buffer`/`Uint8Array` in Node.js and `parse_bytes()` in Python.**

LiteParse, the high-performance PDF parsing library from the `run-llama/liteparse` repository, abstracts file inputs through a unified `PdfInput` type. While the high-level API accepts file paths, the library provides a direct path for handling PDF data already loaded in memory. This capability is essential for applications processing uploads from HTTP requests, reading from object storage, or working with encrypted document streams.

## Understanding the PdfInput Abstraction

The core of LiteParse's input handling resides in **[`crates/liteparse/src/types.rs`](https://github.com/run-llama/liteparse/blob/main/crates/liteparse/src/types.rs)**, which defines the `PdfInput` enum. This abstraction allows the parser to treat file paths and memory buffers interchangeably:

```rust
pub enum PdfInput {
    /// Path to a PDF file on disk.
    Path(String),
    /// In-memory PDF data.
    Bytes(Vec<u8>),
}

```

When you call the convenience method `LiteParse::parse(&str)`, the library internally constructs a `PdfInput::Path`. To process data already in memory, you must bypass this helper and interact with the lower-level `parse_input` method directly.

## Parsing In-Memory PDFs in Rust

In Rust, parsing PDFs from memory requires calling the async `parse_input` method with the `Bytes` variant.

### Using parse_input with PdfInput::Bytes

First, ensure your PDF data exists as a **\`Vec<u8>\`**. This can originate from file reads, network requests, or generated content. Then instantiate `LiteParse` with your configuration and await the `parse_input` call:

```rust
use liteparse::config::LiteParseConfig;
use liteparse::parser::{LiteParse, PdfInput};

#[tokio::main]
async fn main() -> Result<(), liteparse::error::LiteParseError> {
    // Example: Loading PDF into memory from disk
    // (In production, this could be from reqwest, s3, etc.)
    let pdf_bytes = std::fs::read("document.pdf")?;

    let parser = LiteParse::new(LiteParseConfig::default());
    
    // Pass the bytes directly to parse_input
    let result = parser
        .parse_input(PdfInput::Bytes(pdf_bytes))
        .await?;

    println!("Successfully parsed {} pages", result.pages.len());
    Ok(())
}

```

The `parse_input` method in **[`crates/liteparse/src/parser.rs`](https://github.com/run-llama/liteparse/blob/main/crates/liteparse/src/parser.rs)** validates the input, initializes the PDFium backend, and executes the full processing pipeline—including OCR and layout analysis—without writing temporary files.

## Language Bindings for In-Memory Parsing

LiteParse's language bindings automatically handle the conversion to `PdfInput::Bytes`, providing idiomatic interfaces for each runtime.

### Node.js and TypeScript

The Node.js binding defined in **[`packages/node/src/lib.ts`](https://github.com/run-llama/liteparse/blob/main/packages/node/src/lib.ts)** accepts `Buffer` or `Uint8Array` objects through the standard `parse` method. The TypeScript type definition declares:

```typescript
type LiteParseInput = string | Buffer | Uint8Array;

```

When you pass a Buffer, the binding automatically routes to the `Bytes` variant:

```typescript
import { LiteParse } from "liteparse";
import { readFile } from "fs/promises";

async function parseBuffer() {
  const pdfBuffer = await readFile("document.pdf");
  
  const parser = new LiteParse();
  const result = await parser.parse(pdfBuffer); // Automatically uses Bytes variant
  
  console.log(`Parsed ${result.pages.length} pages`);
}

```

### Python

The Python binding in **[`packages/python/liteparse/parser.py`](https://github.com/run-llama/liteparse/blob/main/packages/python/liteparse/parser.py)** exposes the dedicated `parse_bytes()` method for memory-based parsing. This method accepts Python `bytes` objects:

```python
from liteparse import LiteParse
from pathlib import Path

def parse_from_memory(file_path: Path):
    # Read into bytes (could also be from urllib, boto3, etc.)

    pdf_bytes = file_path.read_bytes()
    
    parser = LiteParse()  # Uses default configuration

    result = parser.parse_bytes(pdf_bytes)  # Directly calls native parse_bytes

    
    print(f"Extracted {len(result.pages)} pages from memory buffer")

```

## How the Byte Parsing Pipeline Works

Using `PdfInput::Bytes` triggers the same **unified processing pipeline** as file-based inputs. The implementation in [`crates/liteparse/src/parser.rs`](https://github.com/run-llama/liteparse/blob/main/crates/liteparse/src/parser.rs) ensures:

- **Zero-copy operations**: The `Vec<u8>` buffer passes directly to PDFium without serialization overhead or temporary file creation.
- **Consistent configuration**: `LiteParseConfig` affects parsing identically regardless of input source, ensuring OCR settings, DPI targets, and extraction modes behave predictably.
- **Async execution**: The `parse_input` method is async across all bindings, preventing blocking of the main thread during PDFium initialization and text layout projection.

## Summary

- **Abstract input type**: LiteParse uses the `PdfInput` enum in [`crates/liteparse/src/types.rs`](https://github.com/run-llama/liteparse/blob/main/crates/liteparse/src/types.rs) to handle both `Path(String)` and `Bytes(Vec<u8>)` variants.
- **Rust implementation**: Call `parser.parse_input(PdfInput::Bytes(buffer)).await` directly instead of the high-level `parse(&str)` method.
- **Language convenience**: Node.js accepts `Buffer` and `Uint8Array` through the standard `parse()` method, while Python provides `parse_bytes()` for `bytes` objects.
- **Performance characteristics**: In-memory parsing avoids disk I/O and temporary file creation, with PDFium reading directly from the provided buffer.
- **Source locations**: Key files include [`crates/liteparse/src/types.rs`](https://github.com/run-llama/liteparse/blob/main/crates/liteparse/src/types.rs) (enum definition), [`crates/liteparse/src/parser.rs`](https://github.com/run-llama/liteparse/blob/main/crates/liteparse/src/parser.rs) (implementation), and language-specific wrappers in [`packages/node/src/lib.ts`](https://github.com/run-llama/liteparse/blob/main/packages/node/src/lib.ts) and [`packages/python/liteparse/parser.py`](https://github.com/run-llama/liteparse/blob/main/packages/python/liteparse/parser.py).

## Frequently Asked Questions

### Does LiteParse support streaming PDF data?

No, LiteParse requires the complete PDF data to be available in memory as a contiguous `Vec<u8>` buffer before parsing begins. The current architecture in [`crates/liteparse/src/types.rs`](https://github.com/run-llama/liteparse/blob/main/crates/liteparse/src/types.rs) defines `PdfInput::Bytes(Vec<u8>)`, which necessitates fully materialized data rather than streams or generators.

### Is there a size limit for in-memory PDF parsing?

LiteParse imposes no explicit size limits within the core Rust library; the constraint depends on your system's available RAM and the memory limits of the PDFium rendering engine. However, the `Vec<u8>` structure used in `PdfInput::Bytes` is limited by Rust's allocator constraints (typically `usize::MAX` on 64-bit systems).

### Can I parse PDFs directly from HTTP requests without saving to disk?

Yes. In Rust, read the request body into a `Vec<u8>` using libraries like `reqwest` or `hyper`, then pass that buffer to `parse_input(PdfInput::Bytes(...))`. In Python, pass `response.content` directly to `parser.parse_bytes()`. In Node.js, pass the response Buffer directly to `parser.parse()`.

### Does parsing from bytes produce different results than parsing from file paths?

No. Both input methods funnel into the identical processing pipeline implemented in [`crates/liteparse/src/parser.rs`](https://github.com/run-llama/liteparse/blob/main/crates/liteparse/src/parser.rs). The `parse` method (for paths) is simply a convenience wrapper that constructs `PdfInput::Path` and calls `parse_input`. Text extraction accuracy, OCR behavior, and layout projection remain identical regardless of the input source.