# How LiteParse Automatically Converts DOCX to PDF: A Technical Deep Dive

> Discover how LiteParse automatically converts DOCX to PDF using LibreOffice headless mode and a Rust extraction engine. Get a clear technical explanation.

- Repository: [LlamaIndex/liteparse](https://github.com/run-llama/liteparse)
- Tags: deep-dive
- Published: 2026-05-30

---

**LiteParse automatically converts DOCX files to PDF by spawning LibreOffice in headless mode, creating a temporary PDF in a sandboxed directory, and then passing that PDF to its Rust-based extraction engine—all transparently to the user.**

The run-llama/liteparse library handles document conversion seamlessly by treating any non-PDF input as a candidate for conversion before text extraction occurs. When you supply a DOCX file path to the parser, LiteParse triggers an automated pipeline that relies on LibreOffice to generate a temporary PDF, ensuring the downstream extraction logic always receives a standardized format regardless of the original file type.

## The Conversion Pipeline Architecture

LiteParse implements a **conversion-first pipeline** that intercepts non-PDF documents before they reach the extraction engine. Located in [`crates/liteparse/src/conversion.rs`](https://github.com/run-llama/liteparse/blob/main/crates/liteparse/src/conversion.rs), this module defines the extension tables, tool detection logic, and LibreOffice wrappers that make automatic DOCX to PDF conversion possible. The `LiteParse::parse` method in [`crates/liteparse/src/parser.rs`](https://github.com/run-llama/liteparse/blob/main/crates/liteparse/src/parser.rs) (lines 84-87) calls `conversion::resolve_pdf_input`, guaranteeing that downstream rendering logic always receives a valid PDF path.

## Step-by-Step DOCX to PDF Conversion

### File Type Detection via `resolve_pdf_input`

When a file path is supplied, the `resolve_pdf_input` function (lines 97-109 in [`crates/liteparse/src/conversion.rs`](https://github.com/run-llama/liteparse/blob/main/crates/liteparse/src/conversion.rs)) extracts the extension and checks whether it matches the *office* extensions defined in lines 10-12 of the same file. If the file is not already a PDF, the function immediately forwards the path to `convert_to_pdf` for processing.

### Tool Selection and LibreOffice Resolution

The `convert_to_pdf` function examines the file extension and selects the appropriate conversion tool. For DOCX and other Office-related extensions, it maps the input to `ConversionTool::LibreOffice` (lines 68-73). Before execution, the helper `find_libre_office_command` (lines 60-71) discovers a usable LibreOffice binary on the host system, searching for `libreoffice`, `soffice`, or known platform-specific installation paths.

### Headless Conversion Execution

The `convert_office_document` function (lines 200-211) invokes the resolved LibreOffice binary in **headless mode** to eliminate GUI dependencies. It creates a temporary user profile to avoid lock contention with existing LibreOffice instances and directs the PDF output to a freshly created temporary directory. This sandboxed approach ensures concurrent conversions do not interfere with each other.

### PDF Discovery and Parser Integration

Because LibreOffice may alter output filenames during conversion, `find_pdf_in_dir` (lines 50-57) scans the temporary output directory for the first `.pdf` file and returns its full path. This discovered path is then passed back through `resolve_pdf_input` to the `LiteParse::parse` method, which proceeds with its high-performance spatial text extraction on the generated PDF.

## Usage Examples

### Node.js Integration

```typescript
// packages/node/src/lib.ts
import { LiteParse } from "liteparse";

(async () => {
  const lp = new LiteParse();               // Default configuration
  const result = await lp.parse("report.docx"); // DOCX auto-converted to PDF
  console.log(result.text);                  // Extracted plain text
})();

```

### Python Integration

```python

# packages/python/liteparse/parser.py

from liteparse import LiteParse

lp = LiteParse()
result = lp.parse("presentation.pptx")  # PPTX converted to PDF under the hood

print(result.text)

```

### Command Line Interface

```bash

# CLI entry point: crates/liteparse/src/main.rs

liteparse myfile.docx --output json > output.json

# Executes LibreOffice headless, then returns JSON representation of parsed PDF

```

## Error Handling and System Requirements

The conversion process requires LibreOffice (or a compatible `soffice` binary) to be installed on the host system. If the binary is not found, LiteParse returns a clear `LiteParseError::Conversion` error containing installation instructions (defined in [`crates/liteparse/src/conversion.rs`](https://github.com/run-llama/liteparse/blob/main/crates/liteparse/src/conversion.rs) lines 8-13). This error propagation ensures callers receive actionable feedback rather than opaque process failures.

## Summary

- `resolve_pdf_input` in [`crates/liteparse/src/conversion.rs`](https://github.com/run-llama/liteparse/blob/main/crates/liteparse/src/conversion.rs) (lines 97-109) detects non-PDF inputs and routes them to the conversion pipeline.
- Office extensions map to `ConversionTool::LibreOffice` (lines 68-73), triggering the LibreOffice backend.
- `find_libre_office_command` (lines 60-71) locates the system binary, supporting cross-platform deployment.
- `convert_office_document` (lines 200-211) executes headless conversion with isolated temporary profiles to prevent lock contention.
- `find_pdf_in_dir` (lines 50-57) discovers the generated PDF despite potential filename modifications by LibreOffice.
- The `LiteParse::parse` method in [`parser.rs`](https://github.com/run-llama/liteparse/blob/main/parser.rs) (lines 84-87) orchestrates the entire flow, ensuring extraction logic always receives standardized PDF input.

## Frequently Asked Questions

### What dependencies are required for DOCX to PDF conversion?

LiteParse requires LibreOffice or a compatible `soffice` binary to be installed on the host system. If the binary is missing, the library returns a `LiteParseError::Conversion` containing detailed installation instructions according to the error definitions in [`crates/liteparse/src/conversion.rs`](https://github.com/run-llama/liteparse/blob/main/crates/liteparse/src/conversion.rs).

### How does LiteParse handle LibreOffice output filename variations?

Because LibreOffice may modify output filenames during the conversion process, LiteParse uses the `find_pdf_in_dir` function to scan the temporary output directory and return the first `.pdf` file found. This approach ensures reliable file discovery regardless of naming changes introduced by the office suite.

### Can LiteParse convert other Microsoft Office formats besides DOCX?

Yes. According to the extension table in [`crates/liteparse/src/conversion.rs`](https://github.com/run-llama/liteparse/blob/main/crates/liteparse/src/conversion.rs), LiteParse treats all office-related extensions—including `.pptx` and `.xlsx`—as candidates for LibreOffice conversion, mapping them to `ConversionTool::LibreOffice` and processing them through the same `convert_office_document` pipeline.

### Is the DOCX to PDF conversion process thread-safe?

Yes. Each conversion operation creates isolated temporary user profiles and output directories, preventing lock contention between concurrent executions. The `convert_office_document` function specifically initializes temporary workspaces for every conversion, making the process safe for multi-threaded and async environments.