How RAGFlow Parses Multi-Modal Information from PDFs and DOCX Files: A Deep Dive into the DeepDoc Parser

RAGFlow extracts text, tables, figures, and layout metadata from PDFs and DOCX files using specialized parsers (RAGFlowPdfParser and RAGFlowDocxParser) that orchestrate OCR, layout recognition, and table structure detection to produce a unified, position-aware representation for downstream RAG pipelines.

The infiniflow/ragflow repository implements a sophisticated document understanding pipeline called DeepDoc that converts multi-modal documents into structured, retrievable chunks. Whether processing scanned PDFs requiring OCR or native DOCX files with embedded tables, RAGFlow parses multi-modal information through a sequence of specialized stages that preserve spatial and semantic relationships.

Overview of the Multi-Modal Parsing Pipeline

RAGFlow processes PDF and DOCX files through dedicated parser classes that handle the unique characteristics of each format. The RAGFlowPdfParser class (defined at line 55 of deepdoc/parser/pdf_parser.py) handles rasterized documents, scanned images, and complex layouts, while the RAGFlowDocxParser class (line 25 of deepdoc/parser/docx_parser.py) processes structured Office Open XML documents.

Both parsers output standardized structures: the PDF parser returns boxes with spatial coordinates and layout types, while the DOCX parser returns sections of text with style metadata and processed tables.

Parsing PDF Documents with RAGFlowPdfParser

The PDF parsing pipeline in RAGFlow is an eight-stage process that transforms raw page images into structured, searchable content. The main entry point parse_into_bboxes (line 1547 of pdf_parser.py) orchestrates this workflow.

Stage 1: File Loading and OCR/Vision Processing

RAGFlow begins by loading the PDF using pdfplumber.open or a binary stream. For each page, the parser extracts an image and processes it through one of two pathways:

  • Default OCR mode: The self.ocr engine (implemented in deepdoc/vision/ocr.py) performs text recognition on page images.
  • Vision mode: The VisionParser subclass (line 1806) overrides image processing to send pages to a vision-LLM via picture_vision_llm_chunk, generating descriptive text for complex diagrams or images.

Stage 2: Layout Analysis and Region Classification

The parser calls self._layouts_rec(zoomin) to analyze page structure. This method utilizes a LayoutRecognizer (or AscendLayoutRecognizer for Huawei Ascend hardware) to classify regions as text, table, figure, or other layout types. Each detected box receives spatial coordinates and a layout label.

Stage 3: Table Structure Recognition

For table regions, _table_transformer_job invokes the TableStructureRecognizer to detect cell boundaries and row/column relationships. The environment variable TABLE_AUTO_ROTATE (defaulting to true) enables automatic orientation correction for tables captured at angles.

Stage 4: Text Merging and Position Tagging

Fragmented text boxes undergo consolidation through _text_merge, _concat_downward, and _naive_vertical_merge. These methods combine boxes belonging to the same logical line or paragraph. Each final box receives a position tag via _line_tag, encoding page number, coordinates, and layout type. The crop() method extracts image regions for figures and tables when needed.

The final output consists of boxes (containing page_number, x0, x1, top, bottom, layout_type, text, image, and positions) and a separate list of extracted tables.

Parsing DOCX Documents with RAGFlowDocxParser

DOCX processing leverages the structured nature of Office Open XML. The RAGFlowDocxParser uses python-docx to load documents and extract content without requiring OCR.

Extracting Paragraphs and Styles

The __call__ method (line 16 of docx_parser.py) iterates through document paragraphs, concatenating runs within single paragraphs using "".join(runs_within_single_paragraph). It returns secs, a list of (paragraph_text, style_name) tuples that preserve document hierarchy and formatting metadata.

Table Detection and Content Composition

Tables are detected via the python-docx table API. The parser calls __extract_table_content to retrieve cell data, then processes it through __compose_table_content to merge headers and format rows. The output tbls contains processed table strings ready for chunking.

Vision-Enabled Parsing for Complex PDFs

For documents containing complex diagrams, screenshots, or handwritten content that OCR cannot adequately capture, RAGFlow provides VisionParser. This subclass overrides the standard image processing pipeline to utilize multimodal LLMs.

When VisionParser processes a PDF, it sends page images to a vision-LLM using picture_vision_llm_chunk (defined in rag/app/picture.py) with prompts generated by vision_llm_describe_prompt (from rag/prompts/generator.py). The LLM returns descriptive text that replaces or supplements OCR output, enabling semantic understanding of visual content.

Code Examples

Parsing a PDF with Default OCR

from deepdoc.parser.pdf_parser import RAGFlowPdfParser

parser = RAGFlowPdfParser()

def progress_callback(percentage, message):
    print(f"{percentage*100:.1f}% – {message}")

boxes, tables = parser.parse_into_bboxes(
    fnm="path/to/document.pdf",
    callback=progress_callback,
    zoomin=3  # Higher zoom levels improve OCR accuracy

)

# boxes contains: page_number, coordinates, layout_type, text, image, positions

# tables contains extracted table structures

Parsing a PDF with Vision LLM

from deepdoc.parser.pdf_parser import VisionParser

vision_parser = VisionParser(
    vision_model="gpt-4o-mini-vision",
)

docs, _ = vision_parser(
    filename="path/to/diagram.pdf",
    from_page=0,
    to_page=10,
    callback=lambda p, m: print(f"Progress: {p}, Message: {m}")
)

# docs contains LLM-generated descriptions of visual content

Parsing a DOCX File

from deepdoc.parser.docx_parser import RAGFlowDocxParser

parser = RAGFlowDocxParser()
paragraphs, tables = parser(
    fnm="path/to/document.docx",
    from_page=0,
    to_page=100
)

# paragraphs: [(text, style_name), ...]

# tables: list of formatted table strings

Key Implementation Files

File Role Direct Link
deepdoc/parser/pdf_parser.py Core PDF parsing logic, OCR, layout analysis, table recognition pdf_parser.py
deepdoc/parser/docx_parser.py DOCX text and table extraction using python-docx docx_parser.py
deepdoc/vision/ocr.py OCR engine interface for text extraction from images ocr.py
deepdoc/vision/layout_recognizer.py Layout analysis models (ONNX and Ascend variants) layout_recognizer.py
deepdoc/vision/table_structure_recognizer.py Table cell detection and auto-rotation logic table_structure_recognizer.py
rag/app/picture.py Vision-LLM wrapper for image description picture.py
rag/prompts/generator.py Prompt templates for vision model interactions generator.py

Summary

  • RAGFlow processes multi-modal documents through specialized parsers that extract text, tables, figures, and layout metadata from both rasterized PDFs and structured DOCX files.
  • The RAGFlowPdfParser orchestrates an eight-stage pipeline including OCR (self.ocr), layout recognition (LayoutRecognizer), table structure detection (TableStructureRecognizer), and text merging (_text_merge, _concat_downward) to produce position-tagged bounding boxes.
  • The RAGFlowDocxParser leverages python-docx to extract styled paragraphs and tables without OCR, using __extract_table_content and __compose_table_content for table normalization.
  • For complex visual content, VisionParser extends the PDF pipeline to utilize multimodal LLMs via picture_vision_llm_chunk, generating semantic descriptions of diagrams and images that traditional OCR cannot capture.

Frequently Asked Questions

How does RAGFlow handle scanned PDFs that contain only images?

RAGFlow processes scanned PDFs through the RAGFlowPdfParser which extracts page images and runs them through an OCR engine (self.ocr defined in deepdoc/vision/ocr.py). The parser converts each page to an image, performs text recognition, and then proceeds with layout analysis and table detection on the OCR output. For documents where OCR is insufficient, the VisionParser class can send page images to a vision-LLM to generate descriptive text.

What is the difference between the standard PDF parser and the VisionParser in RAGFlow?

The standard RAGFlowPdfParser relies on traditional OCR and rule-based layout analysis to extract text, tables, and figures. In contrast, VisionParser (defined at line 1806 of deepdoc/parser/pdf_parser.py) overrides the image processing methods to utilize multimodal LLMs. When VisionParser processes a document, it sends page images to a vision model via picture_vision_llm_chunk (from rag/app/picture.py) using prompts from vision_llm_describe_prompt, returning semantic descriptions rather than raw OCR text.

How does RAGFlow detect and extract tables from PDF documents?

Table extraction in RAGFlow involves multiple stages within RAGFlowPdfParser. First, the layout recognizer (LayoutRecognizer or AscendLayoutRecognizer) identifies table regions. Then, the _table_transformer_job method invokes TableStructureRecognizer (from deepdoc/vision/table_structure_recognizer.py) to detect cell boundaries and row/column structures. The environment variable TABLE_AUTO_ROTATE (defaulting to true) enables automatic orientation correction for tables captured at angles. Finally, the extracted table data is formatted and returned alongside other document elements.

Can RAGFlow preserve document formatting and hierarchy when parsing DOCX files?

Yes, the RAGFlowDocxParser (defined at line 25 of deepdoc/parser/docx_parser.py) preserves document structure by extracting both text content and style information. When parsing, it returns secs—a list of tuples containing (paragraph_text, style_name)—which maintains the hierarchical relationship between headings, body text, and other styled elements. For tables, the parser uses __extract_table_content to retrieve cell data and __compose_table_content to merge headers and format rows, ensuring tabular structure is maintained in the output.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →