Memory Footprint of Parsing Large PDFs with LiteParse: A Technical Deep Dive
LiteParse typically consumes 600 MB of RAM for a 200 MB, 1,000‑page PDF without OCR, but can escalate to 2–3 GB when OCR is enabled due to bitmap buffers held for each page.
The run-llama/liteparse library is a Rust-based PDF parser that extracts structured text and layout data. Unlike streaming parsers, it loads the entire document into memory using the PDFium native library, which directly impacts how much RAM is required when processing large documents. Understanding these allocation patterns helps developers configure the parser to stay within their infrastructure limits.
How LiteParse Allocates Memory During PDF Parsing
LiteParse processes documents through three distinct stages that run sequentially for every page. Each stage holds specific data structures in memory until the final ParseResult is assembled in crates/liteparse/src/parser.rs.
PDF Loading and Text Extraction
The process begins when pdfium::Document::load maps the entire file into memory. For each page, the extract_page_text function in crates/liteparse/src/extract.rs pulls all text items into a Vec<TextItem>. The Parser::parse method then collects these vectors from every page into a single ParseResult structure. This means the raw file size sits in memory via PDFium, plus every extracted text item accumulates in Rust-managed heap space.
Spatial Layout Reconstruction
After extraction, Projection::run (called from Parser::parse) builds intermediate geometric structures in crates/liteparse/src/projection.rs. For every page, the engine constructs an anchor grid and column structures stored as PageLayout and AnchorGrid objects. These structures remain allocated until the final output serialization completes, adding overhead proportional to page complexity and count.
Optional OCR Processing
When ocr.enabled is set to true, memory usage jumps significantly. The Renderer::render_page function in crates/liteparse/src/render.rs converts each page into an image bitmap—approximately 8 MB for a typical A4 page at 150 dpi before compression, or 2–3 MB after PNG compression. These buffers are passed to the OCR engine (Tesseract or HTTP service) via OcrEngine::process. Results are stored alongside native text items in crates/liteparse/src/ocr_merge.rs, meaning both the bitmap and OCR text coexist temporarily in memory.
Memory Usage Profiles for Large PDFs
Baseline memory consumption scales roughly O(number of pages × average text‑items per page), but specific configurations create vastly different footprints.
-
Baseline (no OCR): A 200 MB PDF with 1,000 pages typically stays under 600 MB of RAM on modern hardware. This covers the PDFium file mapping plus extracted text vectors and projection structures.
-
With OCR enabled: The same document can demand 2–3 GB of RAM. Each rendered page adds an image buffer, and with default settings, multiple pages may be processed in parallel.
-
Peak usage: The highest RAM point occurs after the last page renders but before OCR results merge, because both bitmap buffers and native text structures coexist. Once
ocr_merge.rscompletes its work, bitmap buffers are dropped and memory decreases.
Configuration Options to Reduce Memory Consumption
LiteParse exposes configuration flags in crates/liteparse/src/config.rs that directly control allocation behavior:
-
ocr.enabled = false: Eliminates all image rendering and OCR allocations, keeping memory near the size of extracted text only. -
max_pages: Limits parsing to a specified subset. SettingmaxPages: 100reduces memory footprint to roughly 20 percent of the full document size. -
ocr.concurrency: Controls parallel processing inocr::http_simple.rsandocr::tesseract.rs. Settingconcurrency: 2ensures at most two page bitmaps exist simultaneously, reducing peak RAM during OCR phases.
Why LiteParse Cannot Stream PDFs Incrementally
The library relies on PDFium, which expects the whole document to be loaded into its native memory arena. This architectural constraint in pdfium::Document::load prevents LiteParse from streaming pages incrementally without first loading the entire file. All per-page data—including text items, layout grids, and optional OCR bitmaps—are retained until the final JSON or text output is assembled in Parser::parse.
Code Examples for Memory-Conscious Parsing
Basic Usage with Minimal Memory (OCR Disabled)
import { LiteParse } from "liteparse";
(async () => {
const parser = new LiteParse({ ocr: { enabled: false } });
const result = await parser.parse("large-document.pdf");
console.log(`Parsed ${result.pages.length} pages`);
})();
Memory impact: Only native text items are stored; a 500-page PDF typically fits under 300 MB.
Parsing a Subset to Limit RAM
import { LiteParse } from "liteparse";
(async () => {
const parser = new LiteParse({
maxPages: 100, // stops after 100 pages
ocr: { enabled: false }
});
const result = await parser.parse("huge-document.pdf");
console.log(`Processed ${result.pages.length} pages`);
})();
Memory impact: Roughly 20 percent of the full-document footprint because only 100 pages are held in memory.
Explicitly Disabling OCR Bitmap Buffers
import { LiteParse } from "liteparse";
(async () => {
const parser = new LiteParse({
ocr: { enabled: false } // skips rendering entirely
});
const result = await parser.parse("scanned.pdf");
console.log(`Extracted ${result.textItems.length} native text items`);
})();
Memory impact: No large image buffers; usage stays proportional to extracted text volume.
Limiting OCR Concurrency for Large Documents
import { LiteParse } from "liteparse";
(async () => {
const parser = new LiteParse({
ocr: {
enabled: true,
concurrency: 2 // only two pages rendered/OCR’d simultaneously
}
});
const result = await parser.parse("scanned-big.pdf");
console.log(`Completed OCR with reduced peak memory`);
})();
Memory impact: Peak RAM is capped because at most two page bitmaps (~4–6 MB compressed) exist at once.
Summary
- LiteParse requires the entire PDF file in memory due to PDFium constraints, creating a baseline overhead equal to the file size plus extraction buffers.
- Without OCR, expect roughly 600 MB for a 200 MB, 1,000-page document.
- With OCR enabled, image bitmaps can drive usage to 2–3 GB for the same document size.
- Disable OCR or use
max_pagesto restrict memory to specific page ranges. - Reduce
ocr.concurrencyto limit simultaneous bitmap allocations when OCR is necessary. - Peak memory occurs during the merge phase in
ocr_merge.rs, when bitmaps and text structures coexist.
Frequently Asked Questions
How much RAM do I need to parse a 1,000-page PDF with LiteParse?
For a 1,000-page document without OCR, allocate approximately 600 MB of RAM. If OCR is enabled, plan for 2–3 GB due to image bitmap buffers. Machines with less than 4 GB of RAM should disable OCR or process the document in chunks using max_pages.
Can LiteParse process PDFs larger than my available system memory?
No. Because pdfium::Document::load in crates/liteparse/src/extract.rs maps the entire file into memory, the sum of your PDF file size plus extraction overhead must fit within available RAM. For documents exceeding physical memory, split the PDF externally before processing or use max_pages to handle segments sequentially.
Why does LiteParse use more memory than command-line tools like pdftotext?
LiteParse constructs rich spatial structures like AnchorGrid and PageLayout in crates/liteparse/src/projection.rs to preserve document layout, whereas simple text extractors discard positional metadata. Additionally, the PDFium dependency keeps the full file mapped in memory, whereas streaming parsers can process page-by-page without loading the entire binary.
Does LiteParse release memory after processing each page?
No. All page data accumulates in the ParseResult structure until Parser::parse returns. Text items, layout grids, and OCR results are retained in memory throughout the parsing lifecycle and only deallocated when the result object goes out of scope. Use max_pages to force earlier termination and reduce total accumulation.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →