# How LiteParse Converts DOCX, XLSX, and PPTX Files: Format Conversion Pipeline Explained

> Discover how LiteParse converts DOCX, XLSX, and PPTX to PDF using LibreOffice and extracts text with its PDFium-based engine. Learn about the format conversion pipeline.

- Repository: [LlamaIndex/liteparse](https://github.com/run-llama/liteparse)
- Tags: internals
- Published: 2026-05-31

---

**LiteParse converts DOCX, XLSX, and PPTX files to PDF using LibreOffice in a headless sandboxed process, then extracts text using its PDFium-based engine.**

The run-llama/liteparse library handles office documents by transparently transforming them into PDFs before parsing. This format conversion pipeline lives in the Rust core and operates automatically across all language bindings. Understanding this flow helps diagnose conversion failures and optimize document processing workflows.

## Extension Detection and Classification

The pipeline begins in [`crates/liteparse/src/conversion.rs`](https://github.com/run-llama/liteparse/blob/main/crates/liteparse/src/conversion.rs) by categorizing input files into three distinct groups.

**Word processing formats** (lines 10‑13) include `doc`, `docx`, `docm`, `dot`, `dotm`, `dotx`, `odt`, `ott`, `rtf`, and `pages`.

**Presentation formats** (lines 14‑15) cover `ppt`, `pptx`, `pptm`, `pot`, `potm`, `potx`, `odp`, `otp`, and `key`.

**Spreadsheet formats** (lines 16‑18) encompass `xls`, `xlsx`, `xlsm`, `xlsb`, `ods`, `ots`, `csv`, `tsv`, and `numbers`.

These constants enable the system to identify office documents before attempting any conversion.

## Conversion Tool Selection

When `resolve_pdf_input` receives a non-PDF path, the `convert_to_pdf` function evaluates the extension against the three office categories.

```rust
let tool = if OFFICE_EXTENSIONS.contains(&ext)
    || PRESENTATION_EXTENSIONS.contains(&ext)
    || SPREADSHEET_EXTENSIONS.contains(&ext) {
    ConversionTool::LibreOffice
} else if IMAGE_EXTENSIONS.contains(&ext) {
    ConversionTool::ImageMagick
} else { … }

```

This decision logic appears around lines 68‑71. If the extension matches any office constant, the pipeline selects **LibreOffice** as the backend converter.

## LibreOffice Execution Strategy

### Command Discovery

Before spawning processes, `find_libre_office_command` (lines 60‑92) locates the executable across platforms. It searches for `libreoffice` or `soffice` in system PATH, plus OS-specific locations like `/Applications/LibreOffice.app/...` on macOS and `C:\Program Files\Libreoffice\program\soffice.exe` on Windows.

### Sandbox Conversion Process

The `convert_office_document` function (lines 202‑226) executes LibreOffice with strict isolation:

- Creates a temporary user-profile directory to prevent profile lock contention
- Invokes LibreOffice with `--convert-to pdf` and directs output to a fresh temporary folder
- Waits for process completion before proceeding

### PDF Retrieval

Because LibreOffice may rename output files during conversion, `find_pdf_in_dir` (lines 250‑274) scans the temporary output directory to locate the generated PDF. This robust retrieval method ensures the pipeline captures the correct file regardless of source document naming conventions.

## Integration with the Parser Pipeline

The conversion step is invisible to end users. Both the Rust API and language bindings call `conversion::resolve_pdf_input` before text extraction.

In [`crates/liteparse/src/parser.rs`](https://github.com/run-llama/liteparse/blob/main/crates/liteparse/src/parser.rs) (lines 84‑86), the `LiteParse::parse` method invokes the conversion routine, then immediately passes the resulting PDF path to the PDFium-based extractor. This design ensures DOCX, XLSX, and PPTX files undergo the same text extraction pipeline as native PDFs.

## Usage Examples

The following examples demonstrate automatic conversion across supported languages.

### Rust

```rust
use liteparse::{LiteParse, LiteParseConfig};

#[tokio::main]
async fn main() -> Result<(), liteparse::LiteParseError> {
    // Conversion requires LibreOffice on the host system
    let cfg = LiteParseConfig::default();
    let parser = LiteParse::new(cfg);

    // DOCX is converted to PDF automatically before parsing
    let result = parser.parse("report.docx").await?;
    println!("Extracted {} pages", result.pages.len());
    Ok(())
}

```

### Node.js / TypeScript

```typescript
import { LiteParse } from "liteparse";

async function processSpreadsheet() {
  const parser = new LiteParse({ ocrEnabled: false });
  
  // XLSX conversion happens transparently
  const res = await parser.parse("financials.xlsx");
  console.log(`Pages: ${res.pages.length}`);
  console.log(res.text);
}

```

### Python

```python
from liteparse import LiteParse

parser = LiteParse()

# PPTX is converted to PDF behind the scenes

result = parser.parse("presentation.pptx")
print(f"Pages parsed: {len(result.pages)}")

```

## Summary

- **Extension-based routing**: The system classifies inputs as Word, Presentation, or Spreadsheet types in [`conversion.rs`](https://github.com/run-llama/liteparse/blob/main/conversion.rs) (lines 10‑18).
- **LibreOffice backend**: Office formats route through `convert_office_document` using headless LibreOffice execution.
- **Sandboxed execution**: Each conversion uses isolated temporary profiles and output directories to avoid conflicts.
- **Transparent integration**: The [`parser.rs`](https://github.com/run-llama/liteparse/blob/main/parser.rs) module (lines 84‑86) calls `resolve_pdf_input`, making conversion invisible to API consumers.
- **Cross-platform support**: The pipeline discovers LibreOffice on Linux, macOS, and Windows systems automatically.

## Frequently Asked Questions

### Does LiteParse require LibreOffice to be installed?

Yes. The format conversion pipeline depends on a system-installed LibreOffice (or OpenOffice) binary. The `find_libre_office_command` function searches standard installation paths, but the software must be present for DOCX, XLSX, or PPTX processing to succeed.

### What happens if LibreOffice is not found on the system?

If the command discovery logic (lines 60‑92) fails to locate a suitable binary, the conversion routine returns an error before attempting to process the file. The error propagates through `resolve_pdf_input` and surfaces as a conversion failure in the respective language binding.

### Are temporary files cleaned up after conversion?

The implementation creates temporary directories for both the LibreOffice user profile and the PDF output. While the source analysis indicates these are fresh temporary folders, production deployments should verify that their runtime environment properly cleans up these transient directories after `find_pdf_in_dir` completes and the PDF handle is released.

### Does the conversion support legacy formats like .doc and .xls?

Yes. The extension lists in [`conversion.rs`](https://github.com/run-llama/liteparse/blob/main/conversion.rs) explicitly include legacy formats: `doc`, `xls`, and `ppt` appear alongside their modern Open XML counterparts. The LibreOffice backend handles both old binary formats and newer XML-based standards through the same `--convert-to pdf` command.