How LiteParse Automatically Converts XLSX to PDF

LiteParse automatically converts XLSX files to PDF by detecting the spreadsheet extension, invoking a headless LibreOffice process via the convert_to_pdf function, and returning the generated PDF for parsing.

The run-llama/liteparse library handles spreadsheet documents by transparently converting them to PDF before text extraction. When you provide an XLSX file path to the parser, the Rust core immediately triggers a conversion pipeline that uses a sandboxed LibreOffice instance. This process requires no manual intervention and integrates seamlessly with the available Node.js and Python bindings.

The XLSX-to-PDF Conversion Pipeline

Extension Detection and Routing

In crates/liteparse/src/conversion.rs, the system maintains the SPREADSHEET_EXTENSIONS constant at lines 16-18, which includes "xlsx" alongside other spreadsheet formats. When a file enters the system, the resolve_pdf_input function (lines 97-112) checks whether the input is already a PDF. For non-PDF files, it delegates immediately to convert_to_pdf, initiating the automatic conversion flow.

Tool Selection via Extension Groups

Inside convert_to_pdf (lines 68-71), the file extension is matched against three distinct extension groups. Because XLSX belongs to SPREADSHEET_EXTENSIONS, the function selects ConversionTool::LibreOffice as the appropriate converter. This variant routes the document through the office document conversion pipeline specifically designed for spreadsheet formats.

LibreOffice Discovery and Execution

Before conversion begins, find_libre_office_command (lines 60-70) searches the system for a LibreOffice executable, checking for libreoffice, soffice, or known platform-specific installation paths. Once located, convert_office_document (lines 31-46) constructs a headless command using a temporary user-profile directory, the --headless flag, --convert-to pdf, and the output directory. This sandboxed execution converts the XLSX to PDF without launching the GUI.

Result Handling and Cleanup

After LibreOffice writes the output, find_pdf_in_dir (lines 50-57) scans the temporary output folder to locate the generated .pdf file. Because LibreOffice may rename the file during conversion, the system uses directory scanning rather than deterministic naming. The discovered PDF path returns to the caller wrapped in a PdfInputGuard, which automatically cleans up temporary directories when parsing completes.

Implementation in Language Bindings

Both the Node.js and Python wrappers expose this functionality through simple APIs that accept XLSX paths directly.

For TypeScript or Node.js applications:

import { LiteParse } from "liteparse";

(async () => {
  // Input can be a path to an XLSX file
  const parser = new LiteParse({ input: "report.xlsx" });
  const result = await parser.parse();
  console.log(result.json()); // JSON output of the extracted text
})();

For Python applications:

from liteparse import LiteParse

parser = LiteParse("report.xlsx")
result = parser.parse()
print(result.to_json())

Under the hood, these bindings invoke the same Rust resolve_pdf_inputconvert_to_pdf → LibreOffice flow described above, eliminating the need for manual conversion steps.

Key Source Files

The conversion logic spans several critical locations in the repository:

  • crates/liteparse/src/conversion.rs: Contains the central conversion logic, including extension tables (SPREADSHEET_EXTENSIONS), tool selection, LibreOffice discovery (find_libre_office_command), and the actual conversion functions (convert_office_document, convert_to_pdf).
  • crates/liteparse/src/parser.rs: Orchestrates input resolution through resolve_pdf_input and passes the resulting PDF to the main parser.
  • crates/liteparse/src/config.rs: Holds configuration options that can enable or disable conversion features.

Summary

  • LiteParse detects XLSX files using the SPREADSHEET_EXTENSIONS constant in conversion.rs (lines 16-18).
  • Non-PDF inputs trigger convert_to_pdf, which routes spreadsheets to the LibreOffice conversion tool.
  • The system automatically discovers the LibreOffice binary via find_libre_office_command (lines 60-70).
  • Headless conversion occurs through convert_office_document (lines 31-46) with sandboxed temporary directories.
  • Generated PDFs are located via directory scanning and wrapped in PdfInputGuard for automatic cleanup.

Frequently Asked Questions

What conversion tool does LiteParse use for XLSX files?

LiteParse uses LibreOffice in headless mode for XLSX conversion. When convert_to_pdf detects a spreadsheet extension, it selects ConversionTool::LibreOffice and invokes the binary discovered by find_libre_office_command with the --convert-to pdf flag.

How does LiteParse locate the LibreOffice executable?

The find_libre_office_command function in crates/liteparse/src/conversion.rs (lines 60-70) searches for executables named libreoffice, soffice, or platform-specific installation paths. It returns the first valid command string found on the system.

Can LiteParse convert other spreadsheet formats besides XLSX?

Yes. The SPREADSHEET_EXTENSIONS constant includes additional formats such as xls and other spreadsheet types. Any extension matching this group receives the same LibreOffice-based PDF conversion treatment as XLSX files.

Is the temporary PDF file cleaned up after parsing?

Yes. LiteParse wraps the conversion result in a PdfInputGuard that manages temporary directories. When parsing completes and the guard drops out of scope, the cleanup process automatically removes the temporary files created during the conversion.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →