# How RAGFlow Parses Multi-Modal Information from PDFs and DOCX Files: A Deep Dive into the DeepDoc Parser

> Discover how RAGFlow's DeepDoc parser extracts text, tables, and figures from PDFs and DOCX files for advanced RAG pipelines. Learn about OCR, layout, and table detection for unified data representation.

- Repository: [InfiniFlow/ragflow](https://github.com/infiniflow/ragflow)
- Tags: deep-dive
- Published: 2026-02-23

---

**RAGFlow extracts text, tables, figures, and layout metadata from PDFs and DOCX files using specialized parsers (`RAGFlowPdfParser` and `RAGFlowDocxParser`) that orchestrate OCR, layout recognition, and table structure detection to produce a unified, position-aware representation for downstream RAG pipelines.**

The `infiniflow/ragflow` repository implements a sophisticated document understanding pipeline called **DeepDoc** that converts multi-modal documents into structured, retrievable chunks. Whether processing scanned PDFs requiring OCR or native DOCX files with embedded tables, RAGFlow parses multi-modal information through a sequence of specialized stages that preserve spatial and semantic relationships.

## Overview of the Multi-Modal Parsing Pipeline

RAGFlow processes PDF and DOCX files through dedicated parser classes that handle the unique characteristics of each format. The `RAGFlowPdfParser` class (defined at line 55 of [`deepdoc/parser/pdf_parser.py`](https://github.com/infiniflow/ragflow/blob/main/deepdoc/parser/pdf_parser.py)) handles rasterized documents, scanned images, and complex layouts, while the `RAGFlowDocxParser` class (line 25 of [`deepdoc/parser/docx_parser.py`](https://github.com/infiniflow/ragflow/blob/main/deepdoc/parser/docx_parser.py)) processes structured Office Open XML documents.

Both parsers output standardized structures: the PDF parser returns **boxes** with spatial coordinates and layout types, while the DOCX parser returns sections of text with style metadata and processed tables.

## Parsing PDF Documents with RAGFlowPdfParser

The PDF parsing pipeline in RAGFlow is an eight-stage process that transforms raw page images into structured, searchable content. The main entry point `parse_into_bboxes` (line 1547 of [`pdf_parser.py`](https://github.com/infiniflow/ragflow/blob/main/pdf_parser.py)) orchestrates this workflow.

### Stage 1: File Loading and OCR/Vision Processing

RAGFlow begins by loading the PDF using `pdfplumber.open` or a binary stream. For each page, the parser extracts an image and processes it through one of two pathways:

- **Default OCR mode**: The `self.ocr` engine (implemented in [`deepdoc/vision/ocr.py`](https://github.com/infiniflow/ragflow/blob/main/deepdoc/vision/ocr.py)) performs text recognition on page images.
- **Vision mode**: The `VisionParser` subclass (line 1806) overrides image processing to send pages to a vision-LLM via `picture_vision_llm_chunk`, generating descriptive text for complex diagrams or images.

### Stage 2: Layout Analysis and Region Classification

The parser calls `self._layouts_rec(zoomin)` to analyze page structure. This method utilizes a `LayoutRecognizer` (or `AscendLayoutRecognizer` for Huawei Ascend hardware) to classify regions as **text**, **table**, **figure**, or other layout types. Each detected box receives spatial coordinates and a layout label.

### Stage 3: Table Structure Recognition

For table regions, `_table_transformer_job` invokes the `TableStructureRecognizer` to detect cell boundaries and row/column relationships. The environment variable `TABLE_AUTO_ROTATE` (defaulting to `true`) enables automatic orientation correction for tables captured at angles.

### Stage 4: Text Merging and Position Tagging

Fragmented text boxes undergo consolidation through `_text_merge`, `_concat_downward`, and `_naive_vertical_merge`. These methods combine boxes belonging to the same logical line or paragraph. Each final box receives a position tag via `_line_tag`, encoding page number, coordinates, and layout type. The `crop()` method extracts image regions for figures and tables when needed.

The final output consists of **boxes** (containing `page_number`, `x0`, `x1`, `top`, `bottom`, `layout_type`, `text`, `image`, and `positions`) and a separate list of extracted **tables**.

## Parsing DOCX Documents with RAGFlowDocxParser

DOCX processing leverages the structured nature of Office Open XML. The `RAGFlowDocxParser` uses `python-docx` to load documents and extract content without requiring OCR.

### Extracting Paragraphs and Styles

The `__call__` method (line 16 of [`docx_parser.py`](https://github.com/infiniflow/ragflow/blob/main/docx_parser.py)) iterates through document paragraphs, concatenating runs within single paragraphs using `"".join(runs_within_single_paragraph)`. It returns `secs`, a list of `(paragraph_text, style_name)` tuples that preserve document hierarchy and formatting metadata.

### Table Detection and Content Composition

Tables are detected via the python-docx table API. The parser calls `__extract_table_content` to retrieve cell data, then processes it through `__compose_table_content` to merge headers and format rows. The output `tbls` contains processed table strings ready for chunking.

## Vision-Enabled Parsing for Complex PDFs

For documents containing complex diagrams, screenshots, or handwritten content that OCR cannot adequately capture, RAGFlow provides `VisionParser`. This subclass overrides the standard image processing pipeline to utilize multimodal LLMs.

When `VisionParser` processes a PDF, it sends page images to a vision-LLM using `picture_vision_llm_chunk` (defined in [`rag/app/picture.py`](https://github.com/infiniflow/ragflow/blob/main/rag/app/picture.py)) with prompts generated by `vision_llm_describe_prompt` (from [`rag/prompts/generator.py`](https://github.com/infiniflow/ragflow/blob/main/rag/prompts/generator.py)). The LLM returns descriptive text that replaces or supplements OCR output, enabling semantic understanding of visual content.

## Code Examples

### Parsing a PDF with Default OCR

```python
from deepdoc.parser.pdf_parser import RAGFlowPdfParser

parser = RAGFlowPdfParser()

def progress_callback(percentage, message):
    print(f"{percentage*100:.1f}% – {message}")

boxes, tables = parser.parse_into_bboxes(
    fnm="path/to/document.pdf",
    callback=progress_callback,
    zoomin=3  # Higher zoom levels improve OCR accuracy

)

# boxes contains: page_number, coordinates, layout_type, text, image, positions

# tables contains extracted table structures

```

### Parsing a PDF with Vision LLM

```python
from deepdoc.parser.pdf_parser import VisionParser

vision_parser = VisionParser(
    vision_model="gpt-4o-mini-vision",
)

docs, _ = vision_parser(
    filename="path/to/diagram.pdf",
    from_page=0,
    to_page=10,
    callback=lambda p, m: print(f"Progress: {p}, Message: {m}")
)

# docs contains LLM-generated descriptions of visual content

```

### Parsing a DOCX File

```python
from deepdoc.parser.docx_parser import RAGFlowDocxParser

parser = RAGFlowDocxParser()
paragraphs, tables = parser(
    fnm="path/to/document.docx",
    from_page=0,
    to_page=100
)

# paragraphs: [(text, style_name), ...]

# tables: list of formatted table strings

```

## Key Implementation Files

| File | Role | Direct Link |
|------|------|-------------|
| [`deepdoc/parser/pdf_parser.py`](https://github.com/infiniflow/ragflow/blob/main/deepdoc/parser/pdf_parser.py) | Core PDF parsing logic, OCR, layout analysis, table recognition | [pdf_parser.py](https://github.com/infiniflow/ragflow/blob/main/deepdoc/parser/pdf_parser.py) |
| [`deepdoc/parser/docx_parser.py`](https://github.com/infiniflow/ragflow/blob/main/deepdoc/parser/docx_parser.py) | DOCX text and table extraction using python-docx | [docx_parser.py](https://github.com/infiniflow/ragflow/blob/main/deepdoc/parser/docx_parser.py) |
| [`deepdoc/vision/ocr.py`](https://github.com/infiniflow/ragflow/blob/main/deepdoc/vision/ocr.py) | OCR engine interface for text extraction from images | [ocr.py](https://github.com/infiniflow/ragflow/blob/main/deepdoc/vision/ocr.py) |
| [`deepdoc/vision/layout_recognizer.py`](https://github.com/infiniflow/ragflow/blob/main/deepdoc/vision/layout_recognizer.py) | Layout analysis models (ONNX and Ascend variants) | [layout_recognizer.py](https://github.com/infiniflow/ragflow/blob/main/deepdoc/vision/layout_recognizer.py) |
| [`deepdoc/vision/table_structure_recognizer.py`](https://github.com/infiniflow/ragflow/blob/main/deepdoc/vision/table_structure_recognizer.py) | Table cell detection and auto-rotation logic | [table_structure_recognizer.py](https://github.com/infiniflow/ragflow/blob/main/deepdoc/vision/table_structure_recognizer.py) |
| [`rag/app/picture.py`](https://github.com/infiniflow/ragflow/blob/main/rag/app/picture.py) | Vision-LLM wrapper for image description | [picture.py](https://github.com/infiniflow/ragflow/blob/main/rag/app/picture.py) |
| [`rag/prompts/generator.py`](https://github.com/infiniflow/ragflow/blob/main/rag/prompts/generator.py) | Prompt templates for vision model interactions | [generator.py](https://github.com/infiniflow/ragflow/blob/main/rag/prompts/generator.py) |

## Summary

- **RAGFlow** processes multi-modal documents through specialized parsers that extract text, tables, figures, and layout metadata from both rasterized PDFs and structured DOCX files.
- The **`RAGFlowPdfParser`** orchestrates an eight-stage pipeline including OCR (`self.ocr`), layout recognition (`LayoutRecognizer`), table structure detection (`TableStructureRecognizer`), and text merging (`_text_merge`, `_concat_downward`) to produce position-tagged bounding boxes.
- The **`RAGFlowDocxParser`** leverages `python-docx` to extract styled paragraphs and tables without OCR, using `__extract_table_content` and `__compose_table_content` for table normalization.
- For complex visual content, **`VisionParser`** extends the PDF pipeline to utilize multimodal LLMs via `picture_vision_llm_chunk`, generating semantic descriptions of diagrams and images that traditional OCR cannot capture.

## Frequently Asked Questions

### How does RAGFlow handle scanned PDFs that contain only images?

RAGFlow processes scanned PDFs through the `RAGFlowPdfParser` which extracts page images and runs them through an OCR engine (`self.ocr` defined in [`deepdoc/vision/ocr.py`](https://github.com/infiniflow/ragflow/blob/main/deepdoc/vision/ocr.py)). The parser converts each page to an image, performs text recognition, and then proceeds with layout analysis and table detection on the OCR output. For documents where OCR is insufficient, the `VisionParser` class can send page images to a vision-LLM to generate descriptive text.

### What is the difference between the standard PDF parser and the VisionParser in RAGFlow?

The standard `RAGFlowPdfParser` relies on traditional OCR and rule-based layout analysis to extract text, tables, and figures. In contrast, `VisionParser` (defined at line 1806 of [`deepdoc/parser/pdf_parser.py`](https://github.com/infiniflow/ragflow/blob/main/deepdoc/parser/pdf_parser.py)) overrides the image processing methods to utilize multimodal LLMs. When `VisionParser` processes a document, it sends page images to a vision model via `picture_vision_llm_chunk` (from [`rag/app/picture.py`](https://github.com/infiniflow/ragflow/blob/main/rag/app/picture.py)) using prompts from `vision_llm_describe_prompt`, returning semantic descriptions rather than raw OCR text.

### How does RAGFlow detect and extract tables from PDF documents?

Table extraction in RAGFlow involves multiple stages within `RAGFlowPdfParser`. First, the layout recognizer (`LayoutRecognizer` or `AscendLayoutRecognizer`) identifies table regions. Then, the `_table_transformer_job` method invokes `TableStructureRecognizer` (from [`deepdoc/vision/table_structure_recognizer.py`](https://github.com/infiniflow/ragflow/blob/main/deepdoc/vision/table_structure_recognizer.py)) to detect cell boundaries and row/column structures. The environment variable `TABLE_AUTO_ROTATE` (defaulting to `true`) enables automatic orientation correction for tables captured at angles. Finally, the extracted table data is formatted and returned alongside other document elements.

### Can RAGFlow preserve document formatting and hierarchy when parsing DOCX files?

Yes, the `RAGFlowDocxParser` (defined at line 25 of [`deepdoc/parser/docx_parser.py`](https://github.com/infiniflow/ragflow/blob/main/deepdoc/parser/docx_parser.py)) preserves document structure by extracting both text content and style information. When parsing, it returns `secs`—a list of tuples containing `(paragraph_text, style_name)`—which maintains the hierarchical relationship between headings, body text, and other styled elements. For tables, the parser uses `__extract_table_content` to retrieve cell data and `__compose_table_content` to merge headers and format rows, ensuring tabular structure is maintained in the output.