# Supported Document Formats for Knowledge Graph Building in Graph-RAG Agent

> Build knowledge graphs with Graph-RAG agent using .txt, .pdf, .md, .docx, .doc, .csv, .json, .yaml, and .yml formats. Discover supported document types for efficient graph construction.

- Repository: [GLK/graph-rag-agent](https://github.com/1517005260/graph-rag-agent)
- Tags: tutorial
- Published: 2026-02-22

---

**The Graph-RAG agent supports nine document formats—.txt, .pdf, .md, .docx, .doc, .csv, .json, .yaml, and .yml—for automated knowledge graph construction.**

Understanding the supported document formats for knowledge graph building is essential when working with the `1517005260/graph-rag-agent` repository. The ingestion pipeline automatically detects file types and applies the appropriate reader, enabling seamless processing of mixed document collections into graph-ready entities.

## Complete List of Supported Document Formats

The ingestion system recognizes nine distinct file extensions. In [`graphrag_agent/pipelines/ingestion/file_reader.py`](https://github.com/1517005260/graph-rag-agent/blob/main/graphrag_agent/pipelines/ingestion/file_reader.py), a dictionary maps each extension to its specialized reader function:

- **.txt** — Plain text files
- **.pdf** — PDF documents
- **.md** — Markdown files
- **.docx** — Word (OpenXML) documents
- **.doc** — Legacy Word documents
- **.csv** — Comma-separated values files
- **.json** — JSON files
- **.yaml** / **.yml** — YAML configuration files

These mappings enable the `FileReader` class to instantiate the correct parser without manual configuration.

## How Document Ingestion Works

The `FileReader` class in [`file_reader.py`](https://github.com/1517005260/graph-rag-agent/blob/main/file_reader.py) orchestrates the extraction process. When you initialize the reader with a directory path, it scans for files matching the supported document formats for knowledge graph building, then routes each file to its appropriate extraction method.

The extracted content is returned as a list of dictionaries containing the raw text and metadata. This output feeds directly into the `DocumentProcessor` class defined in [`graphrag_agent/pipelines/ingestion/document_processor.py`](https://github.com/1517005260/graph-rag-agent/blob/main/graphrag_agent/pipelines/ingestion/document_processor.py), which handles the transformation into graph nodes.

## Practical Code Examples

### Ingesting a Directory of Mixed Documents

Process an entire folder containing multiple supported document formats for knowledge graph building:

```python
from graphrag_agent.pipelines.ingestion.file_reader import FileReader

# Path to the folder containing your source documents

data_dir = "/path/to/my/documents"

# Optional: limit to specific extensions (e.g., only PDFs and TXT files)

extensions = [".pdf", ".txt"]   # omit to use all supported formats

reader = FileReader(directory_path=data_dir)
documents = reader.read_files(file_extensions=extensions)

# `documents` is a list of dicts with raw text and metadata, ready for graph building

```

### Processing a Single File

Extract content from an individual file using the internal helper method:

```python
from graphrag_agent.pipelines.ingestion.file_reader import FileReader

reader = FileReader(directory_path=None)   # `directory_path` is unused for single file reads

doc = reader._read_file("/path/to/report.pdf")   # internal helper; normally called via read_files()

print(doc["content"][:200])   # preview of extracted text

```

### Converting Documents to Graph Nodes

Transform the extracted documents into knowledge graph entities:

```python
from graphrag_agent.pipelines.ingestion.document_processor import DocumentProcessor

processor = DocumentProcessor()
graph_nodes = processor.process_documents(documents)   # converts raw docs into KG entities

```

## Key Implementation Files

Understanding these source files helps you extend or debug the supported document formats for knowledge graph building:

- **[`graphrag_agent/pipelines/ingestion/file_reader.py`](https://github.com/1517005260/graph-rag-agent/blob/main/graphrag_agent/pipelines/ingestion/file_reader.py)** — Implements the file-type-specific readers and the extension-to-function mapping that defines which formats are supported.
- **[`graphrag_agent/pipelines/ingestion/document_processor.py`](https://github.com/1517005260/graph-rag-agent/blob/main/graphrag_agent/pipelines/ingestion/document_processor.py)** — Provides human-readable names for each supported format and contains the post-processing logic that converts extracted text into graph-compatible structures.

## Summary

- The Graph-RAG agent supports **nine document formats**: .txt, .pdf, .md, .docx, .doc, .csv, .json, .yaml, and .yml.
- Format detection and routing occur automatically in [`file_reader.py`](https://github.com/1517005260/graph-rag-agent/blob/main/file_reader.py) through an extension-to-reader mapping.
- The `FileReader` class extracts raw content, while `DocumentProcessor` transforms it into knowledge graph nodes.
- You can filter ingestion by specific extensions or process entire directories of mixed document types.

## Frequently Asked Questions

### Does the Graph-RAG agent support scanned PDFs or only text-based PDFs?

The current implementation in [`file_reader.py`](https://github.com/1517005260/graph-rag-agent/blob/main/file_reader.py) handles standard text-based PDFs through standard extraction libraries. Scanned PDFs requiring OCR (Optical Character Recognition) are not explicitly supported in the base implementation; you would need to pre-process scanned documents with an OCR tool before ingestion.

### Can I add support for additional document formats like .epub or .html?

Yes. To extend the supported document formats for knowledge graph building, modify the extension-to-reader mapping in [`graphrag_agent/pipelines/ingestion/file_reader.py`](https://github.com/1517005260/graph-rag-agent/blob/main/graphrag_agent/pipelines/ingestion/file_reader.py). Add your new extension (e.g., `.epub`) and implement a corresponding reader function that returns a dictionary with `content` and `metadata` keys, following the existing pattern used for .pdf and .docx files.

### How does the system handle CSV files during knowledge graph construction?

CSV files are treated as structured text documents. The `FileReader` extracts the raw content (including headers and rows), and the `DocumentProcessor` processes this text to identify entities and relationships. For more sophisticated graph construction from CSVs—such as treating rows as nodes and columns as properties—you may need to customize the processing logic in [`document_processor.py`](https://github.com/1517005260/graph-rag-agent/blob/main/document_processor.py) or use a specialized CSV-to-graph mapper.

### Is there a file size limit for documents being ingested?

The source code analysis does not specify explicit file size limits in [`file_reader.py`](https://github.com/1517005260/graph-rag-agent/blob/main/file_reader.py) or [`document_processor.py`](https://github.com/1517005260/graph-rag-agent/blob/main/document_processor.py). However, practical limitations depend on your system's available memory and the extraction libraries used (e.g., PDF parsers may load entire documents into RAM). For production deployments, consider implementing chunked reading or streaming processing for files larger than available memory.