How to Configure LiteParse to Parse Specific Page Ranges Efficiently

LiteParse restricts document processing to only the pages you specify through the target_pages configuration option, which accepts compact range strings like "1-5,10,15-20" and minimizes resource usage by loading only requested pages from the PDFium document backend.

When working with large PDF documents, parsing every page wastes significant CPU cycles and memory. LiteParse provides a granular page-filtering mechanism that lets you configure LiteParse to parse specific page ranges while completely skipping unrequested content. This feature is implemented consistently across the CLI, Node.js, Python, and WASM bindings through a unified configuration structure.

Understanding the target_pages Configuration

The page-range feature centers on the target_pages field within the LiteParseConfig struct. According to the source code in crates/liteparse/src/config.rs (lines 14‑18), this field stores an optional string that describes which pages to process, alongside a max_pages safety cap that prevents accidental processing of pathologically large ranges.

The configuration accepts a compact string notation where comma-separated values define individual pages or hyphenated ranges. For example, "1-3,7,10-12" expands to pages 1, 2, 3, 7, 10, 11, and 12. The parser validates, sorts, and deduplicates these numbers before any document I/O occurs.

The Page Range Parsing Pipeline

The implementation follows a four-stage pipeline that ensures efficiency at every step:

  1. Configuration Ingestion – The CLI argument --target-pages (defined in crates/liteparse/src/main.rs at lines 65‑71) or the corresponding constructor parameter in language bindings populates LiteParseConfig.target_pages verbatim.

  2. Range Expansion – When LiteParse::parse executes (lines 91‑100 in crates/liteparse/src/parser.rs), it calls parse_target_pages (lines 66‑96 in crates/liteparse/src/config.rs). This function splits the string on commas, expands hyphenated ranges into individual u32 values, trims whitespace, validates numeric conversion, then sorts and deduplicates the result into a Vec<u32>.

  3. Selective Extraction – The validated page list passes as Option<&[u32]> to the extraction layer in crates/liteparse/src/extract.rs. This module requests only the specified pages from the PDFium document handle, drastically reducing disk I/O and memory mapping operations.

  4. Conditional OCR – Because the OCR pipeline runs after page selection, optical character recognition processes only the requested pages. This prevents wasted CPU cycles on irrelevant content.

Practical Code Examples

CLI Usage

Use the --target-pages flag with standard hyphen and comma notation:

liteparse parse report.pdf \
    --target-pages "1-3,7,10-12" \
    --max-pages 10 \
    --format json \
    --output selected_pages.json

The --max-pages 10 argument provides a hard ceiling that limits total processed pages even if the range string specifies more.

Node.js and TypeScript

The JavaScript bindings serialize the configuration object to the same Rust core:

import { LiteParse } from "liteparse";

const parser = new LiteParse({
  target_pages: "2-4,8",   // identical syntax to CLI
  max_pages: 5,            // optional safety guard
  ocr_enabled: false,      // disable OCR for faster text-only extraction
});

await parser.parse("report.pdf", { output: "out.json", format: "json" });

The constructor parameters map directly to LiteParseConfig fields in config.rs.

Python API

PyO3 marshals the configuration automatically:

from liteparse import LiteParse

parser = LiteParse(
    target_pages="5,9-11",   # string format matches Rust parser

    max_pages=7,
    ocr_enabled=False,
)

result = parser.parse("report.pdf", format="json")
with open("selected.json", "w") as f:
    f.write(result)

Screenshot Generation

The screenshot command reuses the same parse_target_pages logic (lines 42‑48 in main.rs) to render only specific pages:

liteparse screenshot report.pdf --target-pages "1,3,5" --output-dir pages/

Performance Optimization Strategies

To maximize efficiency when processing partial documents:

  • Combine target_pages with max_pages to guarantee an upper bound on computational work regardless of input string complexity.
  • Disable OCR using --no-ocr or ocr_enabled: false when you only need native PDF text, as OCR represents the most expensive operation per page.
  • Prefer contiguous ranges (e.g., "1-100" instead of "1,2,3,...,100") to minimize string parsing overhead, though the deduplication logic in parse_target_pages handles both formats correctly.
  • Validate ranges programmatically before passing them to the constructor to avoid the overhead of parsing invalid strings that will fail at the Rust boundary.

Summary

  • The target_pages option in LiteParseConfig accepts comma-separated page numbers and hyphenated ranges.
  • The parse_target_pages function (lines 66‑96 in config.rs) validates, sorts, and deduplicates the input into a Vec<u32>.
  • The extraction layer in extract.rs loads only the specified pages from the PDFium document, minimizing I/O.
  • OCR and rendering operations apply exclusively to the filtered page set, conserving CPU and memory.
  • The max_pages field provides a safety ceiling against accidental oversized range specifications.

Frequently Asked Questions

What string format does target_pages accept?

target_pages accepts a comma-separated list where each element is either a single page number or a hyphenated range (e.g., "1-5,10,15-20"). The parse_target_pages function in crates/liteparse/src/config.rs expands these ranges, trims whitespace, sorts the results in ascending order, and removes duplicates before processing.

Does OCR run on all pages or only the selected range?

OCR runs only on the selected range. The parsing pipeline resolves the target_pages list before invoking the OCR engine (if enabled), ensuring that computationally expensive text recognition occurs exclusively on the pages you requested, not the entire document.

How does max_pages interact with target_pages?

max_pages acts as a hard ceiling on the total number of pages processed, while target_pages specifies which pages to include. If your range string expands to 50 pages but max_pages is set to 10, LiteParse stops after processing 10 pages. This safety guard, defined in LiteParseConfig (lines 14‑15 in config.rs), prevents resource exhaustion from malicious or accidental oversized inputs.

Can I use page ranges with the screenshot command?

Yes. The screenshot command implements the same target_pages logic found in the parse command. When you pass --target-pages to liteparse screenshot, the tool only renders the specified pages, as implemented in the command definitions at lines 42‑48 of crates/liteparse/src/main.rs.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →