deep-dive

How Entity Extraction Works for Knowledge Graph Construction in Graph-RAG Agent

February 22, 2026 1517005260/graph-rag-agent ↗

Entity extraction in the Graph-RAG agent uses an LLM-driven EntityRelationExtractor class to parse raw text into structured entities and relationships, with deterministic caching and parallel processing to build scalable knowledge graphs.

The graph-rag-agent repository implements a complete pipeline for transforming unstructured documents into queryable knowledge graphs. At the heart of this system lies the entity extraction mechanism, which identifies semantic entities and their relationships before persisting them to Neo4j.

The EntityRelationExtractor Architecture

The extraction workflow centers on the EntityRelationExtractor class defined in graphrag_agent/graph/extraction/entity_extractor.py. This orchestrator manages LLM interactions, result caching, and concurrent processing to convert text chunks into structured graph triples.

Initialization and Prompt Engineering

The extractor initializes by composing a LangChain pipeline that binds system instructions, chat history, and human prompts to an LLM instance. According to the source code in EntityRelationExtractor.__init__ (line 22), the constructor accepts:

An LLM instance (any LangChain-compatible model)
System and human prompt templates
Predefined entity_types and relationship_types lists

These parameters form a ChatPromptTemplate and a processing chain:

self.chain = self.chat_prompt | self.llm

The prompt templates guide the LLM to output JSON containing two lists: "entities" and "relationships", ensuring consistent parsing regardless of input text variability.

Deterministic Caching Strategy

To eliminate redundant LLM calls, the system implements hash-based caching. The _generate_cache_key method (line 77) creates deterministic hashes for each text chunk using generate_hash. Results persist as pickled objects at <cache_dir>/<hash>.pkl via _save_to_cache and _load_from_cache (lines 101-139).

When process_chunks encounters a previously processed chunk, it loads the structured result from disk rather than invoking the LLM. This design guarantees that identical text never triggers duplicate API calls, significantly reducing computational costs during iterative graph development.

Parallel Processing Pipeline

The process_chunks method (line 45) orchestrates batch extraction using ThreadPoolExecutor. The workflow:

Generates cache keys for all incoming chunks
Filters out cached results
Processes uncached chunks concurrently across configurable max_workers
Implements automatic retry logic (up to three attempts) for failed chunks
Fires progress callbacks for real-time monitoring

For memory-constrained environments, stream_process_large_files (line 371) yields chunks lazily while maintaining the same caching logic, enabling processing of multi-gigabyte documents without loading entire files into RAM.

From Raw Text to Structured Triples

Chunk-Level Processing

Individual text segments flow through _process_single_chunk (line 335), which:

Sends the chunk through the LangChain pipeline
Receives raw string output encoding entities and relationships
Parses the response into structured Python dictionaries containing entity IDs, labels, descriptions, and relationship mappings

Each processed chunk returns a tuple with the file metadata and a dictionary separating entities (with id, label, description fields) from relationships (with source, target, type fields).

Integration with DynamicKnowledgeGraphBuilder

Extracted triples feed into the DynamicKnowledgeGraphBuilder class located in graphrag_agent/search/tool/reasoning/kg_builder.py. This builder receives the extractor output and incrementally constructs an in-memory networkx.DiGraph:

kg_builder = DynamicKnowledgeGraphBuilder(
    graph=neo4j_driver,
    entity_relation_extractor=extractor,
)
subgraph = kg_builder.build_query_graph(query, seed_entities)

The builder expands the graph by querying the Neo4j backend for neighboring entities, linking seed entities from user queries with extractor-derived entities via discovered relationships. Finally, the GraphWriter utility (used in incremental_graph_builder.py and build_graph.py) persists the constructed graph to the Neo4j database.

Code Implementation Examples

Instantiating the Extractor

The following implementation demonstrates configuring the extractor with custom entity types and relationship schemas:

from graphrag_agent.graph.extraction import EntityRelationExtractor
from langchain.llms import OpenAI

# Configure LLM and prompt templates

llm = OpenAI(model_name="gpt-4o-mini")
system_tpl = """You are an expert KG extractor. Identify entities of type PERSON, ORG, EVENT and the relations WORKS_FOR, PART_OF in the supplied text. Return a JSON object with two lists: "entities" and "relationships"."""
human_tpl = "{text}"

# Initialize extractor with caching and parallelism

extractor = EntityRelationExtractor(
    llm=llm,
    system_template=system_tpl,
    human_template=human_tpl,
    entity_types=["PERSON", "ORG", "EVENT"],
    relationship_types=["WORKS_FOR", "PART_OF"],
    cache_dir="./cache/graph",
    max_workers=4,
    batch_size=5,
)

# Process document chunks

file_contents = [("doc.txt", "utf-8", ["Alice works for Acme Corp."])]
results = extractor.process_chunks(file_contents)

Building and Persisting the Knowledge Graph

After extraction, integrate with the graph builder and Neo4j backend:

from graphrag_agent.graph.builder import DynamicKnowledgeGraphBuilder
from graphrag_agent.integrations.build.incremental_graph_builder import GraphWriter
from neo4j import GraphDatabase

# Establish Neo4j connection

driver = GraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "password"))

# Configure builder with extractor instance

kg_builder = DynamicKnowledgeGraphBuilder(
    graph=driver,
    entity_relation_extractor=extractor,
)

# Generate subgraph around seed entities

subgraph = kg_builder.build_query_graph(
    query="What projects is Alice involved in?",
    entities=["Alice"],
    depth=2,
)

# Persist to database

writer = GraphWriter(driver)
writer.write_graph(subgraph)

Performance Optimizations

LLM-Centric Design: By delegating extraction to language models rather than regex patterns, the system handles open-ended text without hard-coded rules. Prompt engineering constrains output format while maintaining flexibility for diverse domains.

Deterministic Caching: The hash-based cache ensures idempotent processing. Re-running the pipeline on identical documents costs only disk I/O, not LLM tokens.

Concurrent Execution: ThreadPoolExecutor with configurable max_workers maximizes throughput on multi-core systems while respecting API rate limits through batching controls.

Streaming Architecture: The stream_process_large_files method processes documents chunk-by-chunk, maintaining constant memory footprint regardless of file size. This enables enterprise-scale ingestion of multi-gigabyte document corpora.

Summary

The EntityRelationExtractor class in entity_extractor.py orchestrates LLM-driven extraction using configurable prompt templates for entities and relationships.
Deterministic hashing eliminates redundant processing through pickle-based caching stored in cache_dir.
ThreadPoolExecutor enables parallel processing of chunks with automatic retry logic for failed extractions.
Extracted triples flow into DynamicKnowledgeGraphBuilder for incremental graph construction before persistence via GraphWriter.
Streaming methods support processing of multi-GB documents without loading entire files into memory.

Frequently Asked Questions

What LLM models work best with EntityRelationExtractor?

The extractor accepts any LangChain-compatible LLM, including OpenAI's GPT-4, Anthropic's Claude, or local models via Ollama. The system relies on the model's ability to follow structured output instructions in the system prompt, returning valid JSON with entities and relationships keys. Higher-capacity models generally produce more accurate relationship extraction, though smaller models like GPT-4o-mini suffice for straightforward entity typing tasks.

How does the caching mechanism handle identical text chunks?

The _generate_cache_key method computes a deterministic hash of each text chunk using the generate_hash utility. When process_chunks encounters a hash matching an existing .pkl file in cache_dir, it loads the pickled result via _load_from_cache rather than invoking the LLM. This guarantees that reprocessing the same document or overlapping chunks costs only disk I/O, not API calls.

Can the extractor handle multi-GB documents?

Yes. The stream_process_large_files method (line 371) implements lazy chunk yielding that processes documents incrementally. By yielding chunks one at a time and maintaining the same hash-based cache checks, the system processes arbitrarily large files without loading them entirely into RAM. This streaming architecture supports enterprise-scale document ingestion pipelines.

How are extracted entities mapped to Neo4j nodes?

The DynamicKnowledgeGraphBuilder receives extraction results and constructs a networkx.DiGraph in memory, where entities become nodes and relationships become directed edges. The GraphWriter utility then maps these graph elements to Neo4j's property graph model, creating nodes with labels corresponding to entity_types and relationships with types matching the extracted relationship_types. The builder also queries existing Neo4j data to link new extractions with established graph structures.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:

curl -s "https://instagit.com/install.md"

Add to your MCP client configuration:

{
  "mcpServers": {
    "instagit": {
      "command": "npx",
      "args": ["-y", "instagit@latest"]
    }
  }
}

Ask your agent:

"Use Instagit MCP to understand how 1517005260/graph-rag-agent works."

Works with

Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →