# Memory Footprint of Parsing Large PDFs with LiteParse: A Technical Deep Dive

> Explore LiteParse's memory footprint for large PDFs. Understand RAM usage with and without OCR for efficient document processing with this technical deep dive.

- Repository: [LlamaIndex/liteparse](https://github.com/run-llama/liteparse)
- Tags: deep-dive
- Published: 2026-05-30

---

**LiteParse typically consumes 600 MB of RAM for a 200 MB, 1,000‑page PDF without OCR, but can escalate to 2–3 GB when OCR is enabled due to bitmap buffers held for each page.**

The [run-llama/liteparse](https://github.com/run-llama/liteparse) library is a Rust-based PDF parser that extracts structured text and layout data. Unlike streaming parsers, it loads the entire document into memory using the PDFium native library, which directly impacts how much RAM is required when processing large documents. Understanding these allocation patterns helps developers configure the parser to stay within their infrastructure limits.

## How LiteParse Allocates Memory During PDF Parsing

LiteParse processes documents through three distinct stages that run sequentially for every page. Each stage holds specific data structures in memory until the final `ParseResult` is assembled in [`crates/liteparse/src/parser.rs`](https://github.com/run-llama/liteparse/blob/main/crates/liteparse/src/parser.rs).

### PDF Loading and Text Extraction

The process begins when `pdfium::Document::load` maps the entire file into memory. For each page, the `extract_page_text` function in [`crates/liteparse/src/extract.rs`](https://github.com/run-llama/liteparse/blob/main/crates/liteparse/src/extract.rs) pulls all text items into a `Vec<TextItem>`. The `Parser::parse` method then collects these vectors from every page into a single `ParseResult` structure. This means the raw file size sits in memory via PDFium, plus every extracted text item accumulates in Rust-managed heap space.

### Spatial Layout Reconstruction

After extraction, `Projection::run` (called from `Parser::parse`) builds intermediate geometric structures in [`crates/liteparse/src/projection.rs`](https://github.com/run-llama/liteparse/blob/main/crates/liteparse/src/projection.rs). For every page, the engine constructs an **anchor grid** and **column structures** stored as `PageLayout` and `AnchorGrid` objects. These structures remain allocated until the final output serialization completes, adding overhead proportional to page complexity and count.

### Optional OCR Processing

When `ocr.enabled` is set to `true`, memory usage jumps significantly. The `Renderer::render_page` function in [`crates/liteparse/src/render.rs`](https://github.com/run-llama/liteparse/blob/main/crates/liteparse/src/render.rs) converts each page into an image bitmap—approximately 8 MB for a typical A4 page at 150 dpi before compression, or 2–3 MB after PNG compression. These buffers are passed to the OCR engine (Tesseract or HTTP service) via `OcrEngine::process`. Results are stored alongside native text items in [`crates/liteparse/src/ocr_merge.rs`](https://github.com/run-llama/liteparse/blob/main/crates/liteparse/src/ocr_merge.rs), meaning both the bitmap and OCR text coexist temporarily in memory.

## Memory Usage Profiles for Large PDFs

Baseline memory consumption scales roughly **O(number of pages × average text‑items per page)**, but specific configurations create vastly different footprints.

- **Baseline (no OCR)**: A 200 MB PDF with 1,000 pages typically stays under **600 MB** of RAM on modern hardware. This covers the PDFium file mapping plus extracted text vectors and projection structures.

- **With OCR enabled**: The same document can demand **2–3 GB** of RAM. Each rendered page adds an image buffer, and with default settings, multiple pages may be processed in parallel.

- **Peak usage**: The highest RAM point occurs **after the last page renders but before OCR results merge**, because both bitmap buffers and native text structures coexist. Once [`ocr_merge.rs`](https://github.com/run-llama/liteparse/blob/main/ocr_merge.rs) completes its work, bitmap buffers are dropped and memory decreases.

## Configuration Options to Reduce Memory Consumption

LiteParse exposes configuration flags in [`crates/liteparse/src/config.rs`](https://github.com/run-llama/liteparse/blob/main/crates/liteparse/src/config.rs) that directly control allocation behavior:

- **`ocr.enabled = false`**: Eliminates all image rendering and OCR allocations, keeping memory near the size of extracted text only.

- **`max_pages`**: Limits parsing to a specified subset. Setting `maxPages: 100` reduces memory footprint to roughly 20 percent of the full document size.

- **`ocr.concurrency`**: Controls parallel processing in `ocr::http_simple.rs` and `ocr::tesseract.rs`. Setting `concurrency: 2` ensures at most two page bitmaps exist simultaneously, reducing peak RAM during OCR phases.

## Why LiteParse Cannot Stream PDFs Incrementally

The library relies on PDFium, which **expects the whole document to be loaded** into its native memory arena. This architectural constraint in `pdfium::Document::load` prevents LiteParse from streaming pages incrementally without first loading the entire file. All per-page data—including text items, layout grids, and optional OCR bitmaps—are retained until the final JSON or text output is assembled in `Parser::parse`.

## Code Examples for Memory-Conscious Parsing

### Basic Usage with Minimal Memory (OCR Disabled)

```typescript
import { LiteParse } from "liteparse";

(async () => {
  const parser = new LiteParse({ ocr: { enabled: false } });
  const result = await parser.parse("large-document.pdf");
  console.log(`Parsed ${result.pages.length} pages`);
})();

```

*Memory impact*: Only native text items are stored; a 500-page PDF typically fits under 300 MB.

### Parsing a Subset to Limit RAM

```typescript
import { LiteParse } from "liteparse";

(async () => {
  const parser = new LiteParse({
    maxPages: 100,          // stops after 100 pages
    ocr: { enabled: false }
  });
  const result = await parser.parse("huge-document.pdf");
  console.log(`Processed ${result.pages.length} pages`);
})();

```

*Memory impact*: Roughly 20 percent of the full-document footprint because only 100 pages are held in memory.

### Explicitly Disabling OCR Bitmap Buffers

```typescript
import { LiteParse } from "liteparse";

(async () => {
  const parser = new LiteParse({
    ocr: { enabled: false } // skips rendering entirely
  });
  const result = await parser.parse("scanned.pdf");
  console.log(`Extracted ${result.textItems.length} native text items`);
})();

```

*Memory impact*: No large image buffers; usage stays proportional to extracted text volume.

### Limiting OCR Concurrency for Large Documents

```typescript
import { LiteParse } from "liteparse";

(async () => {
  const parser = new LiteParse({
    ocr: {
      enabled: true,
      concurrency: 2    // only two pages rendered/OCR’d simultaneously
    }
  });
  const result = await parser.parse("scanned-big.pdf");
  console.log(`Completed OCR with reduced peak memory`);
})();

```

*Memory impact*: Peak RAM is capped because at most two page bitmaps (~4–6 MB compressed) exist at once.

## Summary

- LiteParse requires the entire PDF file in memory due to PDFium constraints, creating a baseline overhead equal to the file size plus extraction buffers.
- Without OCR, expect roughly **600 MB for a 200 MB, 1,000-page document**.
- With OCR enabled, image bitmaps can drive usage to **2–3 GB** for the same document size.
- **Disable OCR** or use **`max_pages`** to restrict memory to specific page ranges.
- Reduce **`ocr.concurrency`** to limit simultaneous bitmap allocations when OCR is necessary.
- Peak memory occurs during the merge phase in [`ocr_merge.rs`](https://github.com/run-llama/liteparse/blob/main/ocr_merge.rs), when bitmaps and text structures coexist.

## Frequently Asked Questions

### How much RAM do I need to parse a 1,000-page PDF with LiteParse?

For a 1,000-page document without OCR, allocate approximately **600 MB** of RAM. If OCR is enabled, plan for **2–3 GB** due to image bitmap buffers. Machines with less than 4 GB of RAM should disable OCR or process the document in chunks using `max_pages`.

### Can LiteParse process PDFs larger than my available system memory?

No. Because `pdfium::Document::load` in [`crates/liteparse/src/extract.rs`](https://github.com/run-llama/liteparse/blob/main/crates/liteparse/src/extract.rs) maps the entire file into memory, the sum of your PDF file size plus extraction overhead must fit within available RAM. For documents exceeding physical memory, split the PDF externally before processing or use `max_pages` to handle segments sequentially.

### Why does LiteParse use more memory than command-line tools like pdftotext?

LiteParse constructs rich spatial structures like `AnchorGrid` and `PageLayout` in [`crates/liteparse/src/projection.rs`](https://github.com/run-llama/liteparse/blob/main/crates/liteparse/src/projection.rs) to preserve document layout, whereas simple text extractors discard positional metadata. Additionally, the PDFium dependency keeps the full file mapped in memory, whereas streaming parsers can process page-by-page without loading the entire binary.

### Does LiteParse release memory after processing each page?

No. All page data accumulates in the `ParseResult` structure until `Parser::parse` returns. Text items, layout grids, and OCR results are retained in memory throughout the parsing lifecycle and only deallocated when the result object goes out of scope. Use `max_pages` to force earlier termination and reduce total accumulation.