Supported Document Formats for Knowledge Graph Building in Graph-RAG Agent

The Graph-RAG agent supports nine document formats—.txt, .pdf, .md, .docx, .doc, .csv, .json, .yaml, and .yml—for automated knowledge graph construction.

Understanding the supported document formats for knowledge graph building is essential when working with the 1517005260/graph-rag-agent repository. The ingestion pipeline automatically detects file types and applies the appropriate reader, enabling seamless processing of mixed document collections into graph-ready entities.

Complete List of Supported Document Formats

The ingestion system recognizes nine distinct file extensions. In graphrag_agent/pipelines/ingestion/file_reader.py, a dictionary maps each extension to its specialized reader function:

  • .txt — Plain text files
  • .pdf — PDF documents
  • .md — Markdown files
  • .docx — Word (OpenXML) documents
  • .doc — Legacy Word documents
  • .csv — Comma-separated values files
  • .json — JSON files
  • .yaml / .yml — YAML configuration files

These mappings enable the FileReader class to instantiate the correct parser without manual configuration.

How Document Ingestion Works

The FileReader class in file_reader.py orchestrates the extraction process. When you initialize the reader with a directory path, it scans for files matching the supported document formats for knowledge graph building, then routes each file to its appropriate extraction method.

The extracted content is returned as a list of dictionaries containing the raw text and metadata. This output feeds directly into the DocumentProcessor class defined in graphrag_agent/pipelines/ingestion/document_processor.py, which handles the transformation into graph nodes.

Practical Code Examples

Ingesting a Directory of Mixed Documents

Process an entire folder containing multiple supported document formats for knowledge graph building:

from graphrag_agent.pipelines.ingestion.file_reader import FileReader

# Path to the folder containing your source documents

data_dir = "/path/to/my/documents"

# Optional: limit to specific extensions (e.g., only PDFs and TXT files)

extensions = [".pdf", ".txt"]   # omit to use all supported formats

reader = FileReader(directory_path=data_dir)
documents = reader.read_files(file_extensions=extensions)

# `documents` is a list of dicts with raw text and metadata, ready for graph building

Processing a Single File

Extract content from an individual file using the internal helper method:

from graphrag_agent.pipelines.ingestion.file_reader import FileReader

reader = FileReader(directory_path=None)   # `directory_path` is unused for single file reads

doc = reader._read_file("/path/to/report.pdf")   # internal helper; normally called via read_files()

print(doc["content"][:200])   # preview of extracted text

Converting Documents to Graph Nodes

Transform the extracted documents into knowledge graph entities:

from graphrag_agent.pipelines.ingestion.document_processor import DocumentProcessor

processor = DocumentProcessor()
graph_nodes = processor.process_documents(documents)   # converts raw docs into KG entities

Key Implementation Files

Understanding these source files helps you extend or debug the supported document formats for knowledge graph building:

Summary

  • The Graph-RAG agent supports nine document formats: .txt, .pdf, .md, .docx, .doc, .csv, .json, .yaml, and .yml.
  • Format detection and routing occur automatically in file_reader.py through an extension-to-reader mapping.
  • The FileReader class extracts raw content, while DocumentProcessor transforms it into knowledge graph nodes.
  • You can filter ingestion by specific extensions or process entire directories of mixed document types.

Frequently Asked Questions

Does the Graph-RAG agent support scanned PDFs or only text-based PDFs?

The current implementation in file_reader.py handles standard text-based PDFs through standard extraction libraries. Scanned PDFs requiring OCR (Optical Character Recognition) are not explicitly supported in the base implementation; you would need to pre-process scanned documents with an OCR tool before ingestion.

Can I add support for additional document formats like .epub or .html?

Yes. To extend the supported document formats for knowledge graph building, modify the extension-to-reader mapping in graphrag_agent/pipelines/ingestion/file_reader.py. Add your new extension (e.g., .epub) and implement a corresponding reader function that returns a dictionary with content and metadata keys, following the existing pattern used for .pdf and .docx files.

How does the system handle CSV files during knowledge graph construction?

CSV files are treated as structured text documents. The FileReader extracts the raw content (including headers and rows), and the DocumentProcessor processes this text to identify entities and relationships. For more sophisticated graph construction from CSVs—such as treating rows as nodes and columns as properties—you may need to customize the processing logic in document_processor.py or use a specialized CSV-to-graph mapper.

Is there a file size limit for documents being ingested?

The source code analysis does not specify explicit file size limits in file_reader.py or document_processor.py. However, practical limitations depend on your system's available memory and the extraction libraries used (e.g., PDF parsers may load entire documents into RAM). For production deployments, consider implementing chunked reading or streaming processing for files larger than available memory.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →