How Entity Extraction Works for Knowledge Graph Construction in Graph-RAG Agent
Entity extraction in the Graph-RAG agent uses an LLM-driven EntityRelationExtractor class to parse raw text into structured entities and relationships, with deterministic caching and parallel processing to build scalable knowledge graphs.
The graph-rag-agent repository implements a complete pipeline for transforming unstructured documents into queryable knowledge graphs. At the heart of this system lies the entity extraction mechanism, which identifies semantic entities and their relationships before persisting them to Neo4j.
The EntityRelationExtractor Architecture
The extraction workflow centers on the EntityRelationExtractor class defined in graphrag_agent/graph/extraction/entity_extractor.py. This orchestrator manages LLM interactions, result caching, and concurrent processing to convert text chunks into structured graph triples.
Initialization and Prompt Engineering
The extractor initializes by composing a LangChain pipeline that binds system instructions, chat history, and human prompts to an LLM instance. According to the source code in EntityRelationExtractor.__init__ (line 22), the constructor accepts:
- An LLM instance (any LangChain-compatible model)
- System and human prompt templates
- Predefined
entity_typesandrelationship_typeslists
These parameters form a ChatPromptTemplate and a processing chain:
self.chain = self.chat_prompt | self.llm
The prompt templates guide the LLM to output JSON containing two lists: "entities" and "relationships", ensuring consistent parsing regardless of input text variability.
Deterministic Caching Strategy
To eliminate redundant LLM calls, the system implements hash-based caching. The _generate_cache_key method (line 77) creates deterministic hashes for each text chunk using generate_hash. Results persist as pickled objects at <cache_dir>/<hash>.pkl via _save_to_cache and _load_from_cache (lines 101-139).
When process_chunks encounters a previously processed chunk, it loads the structured result from disk rather than invoking the LLM. This design guarantees that identical text never triggers duplicate API calls, significantly reducing computational costs during iterative graph development.
Parallel Processing Pipeline
The process_chunks method (line 45) orchestrates batch extraction using ThreadPoolExecutor. The workflow:
- Generates cache keys for all incoming chunks
- Filters out cached results
- Processes uncached chunks concurrently across configurable
max_workers - Implements automatic retry logic (up to three attempts) for failed chunks
- Fires progress callbacks for real-time monitoring
For memory-constrained environments, stream_process_large_files (line 371) yields chunks lazily while maintaining the same caching logic, enabling processing of multi-gigabyte documents without loading entire files into RAM.
From Raw Text to Structured Triples
Chunk-Level Processing
Individual text segments flow through _process_single_chunk (line 335), which:
- Sends the chunk through the LangChain pipeline
- Receives raw string output encoding entities and relationships
- Parses the response into structured Python dictionaries containing entity IDs, labels, descriptions, and relationship mappings
Each processed chunk returns a tuple with the file metadata and a dictionary separating entities (with id, label, description fields) from relationships (with source, target, type fields).
Integration with DynamicKnowledgeGraphBuilder
Extracted triples feed into the DynamicKnowledgeGraphBuilder class located in graphrag_agent/search/tool/reasoning/kg_builder.py. This builder receives the extractor output and incrementally constructs an in-memory networkx.DiGraph:
kg_builder = DynamicKnowledgeGraphBuilder(
graph=neo4j_driver,
entity_relation_extractor=extractor,
)
subgraph = kg_builder.build_query_graph(query, seed_entities)
The builder expands the graph by querying the Neo4j backend for neighboring entities, linking seed entities from user queries with extractor-derived entities via discovered relationships. Finally, the GraphWriter utility (used in incremental_graph_builder.py and build_graph.py) persists the constructed graph to the Neo4j database.
Code Implementation Examples
Instantiating the Extractor
The following implementation demonstrates configuring the extractor with custom entity types and relationship schemas:
from graphrag_agent.graph.extraction import EntityRelationExtractor
from langchain.llms import OpenAI
# Configure LLM and prompt templates
llm = OpenAI(model_name="gpt-4o-mini")
system_tpl = """You are an expert KG extractor. Identify entities of type PERSON, ORG, EVENT and the relations WORKS_FOR, PART_OF in the supplied text. Return a JSON object with two lists: "entities" and "relationships"."""
human_tpl = "{text}"
# Initialize extractor with caching and parallelism
extractor = EntityRelationExtractor(
llm=llm,
system_template=system_tpl,
human_template=human_tpl,
entity_types=["PERSON", "ORG", "EVENT"],
relationship_types=["WORKS_FOR", "PART_OF"],
cache_dir="./cache/graph",
max_workers=4,
batch_size=5,
)
# Process document chunks
file_contents = [("doc.txt", "utf-8", ["Alice works for Acme Corp."])]
results = extractor.process_chunks(file_contents)
Building and Persisting the Knowledge Graph
After extraction, integrate with the graph builder and Neo4j backend:
from graphrag_agent.graph.builder import DynamicKnowledgeGraphBuilder
from graphrag_agent.integrations.build.incremental_graph_builder import GraphWriter
from neo4j import GraphDatabase
# Establish Neo4j connection
driver = GraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "password"))
# Configure builder with extractor instance
kg_builder = DynamicKnowledgeGraphBuilder(
graph=driver,
entity_relation_extractor=extractor,
)
# Generate subgraph around seed entities
subgraph = kg_builder.build_query_graph(
query="What projects is Alice involved in?",
entities=["Alice"],
depth=2,
)
# Persist to database
writer = GraphWriter(driver)
writer.write_graph(subgraph)
Performance Optimizations
LLM-Centric Design: By delegating extraction to language models rather than regex patterns, the system handles open-ended text without hard-coded rules. Prompt engineering constrains output format while maintaining flexibility for diverse domains.
Deterministic Caching: The hash-based cache ensures idempotent processing. Re-running the pipeline on identical documents costs only disk I/O, not LLM tokens.
Concurrent Execution: ThreadPoolExecutor with configurable max_workers maximizes throughput on multi-core systems while respecting API rate limits through batching controls.
Streaming Architecture: The stream_process_large_files method processes documents chunk-by-chunk, maintaining constant memory footprint regardless of file size. This enables enterprise-scale ingestion of multi-gigabyte document corpora.
Summary
- The
EntityRelationExtractorclass inentity_extractor.pyorchestrates LLM-driven extraction using configurable prompt templates for entities and relationships. - Deterministic hashing eliminates redundant processing through pickle-based caching stored in
cache_dir. ThreadPoolExecutorenables parallel processing of chunks with automatic retry logic for failed extractions.- Extracted triples flow into
DynamicKnowledgeGraphBuilderfor incremental graph construction before persistence viaGraphWriter. - Streaming methods support processing of multi-GB documents without loading entire files into memory.
Frequently Asked Questions
What LLM models work best with EntityRelationExtractor?
The extractor accepts any LangChain-compatible LLM, including OpenAI's GPT-4, Anthropic's Claude, or local models via Ollama. The system relies on the model's ability to follow structured output instructions in the system prompt, returning valid JSON with entities and relationships keys. Higher-capacity models generally produce more accurate relationship extraction, though smaller models like GPT-4o-mini suffice for straightforward entity typing tasks.
How does the caching mechanism handle identical text chunks?
The _generate_cache_key method computes a deterministic hash of each text chunk using the generate_hash utility. When process_chunks encounters a hash matching an existing .pkl file in cache_dir, it loads the pickled result via _load_from_cache rather than invoking the LLM. This guarantees that reprocessing the same document or overlapping chunks costs only disk I/O, not API calls.
Can the extractor handle multi-GB documents?
Yes. The stream_process_large_files method (line 371) implements lazy chunk yielding that processes documents incrementally. By yielding chunks one at a time and maintaining the same hash-based cache checks, the system processes arbitrarily large files without loading them entirely into RAM. This streaming architecture supports enterprise-scale document ingestion pipelines.
How are extracted entities mapped to Neo4j nodes?
The DynamicKnowledgeGraphBuilder receives extraction results and constructs a networkx.DiGraph in memory, where entities become nodes and relationships become directed edges. The GraphWriter utility then maps these graph elements to Neo4j's property graph model, creating nodes with labels corresponding to entity_types and relationships with types matching the extracted relationship_types. The builder also queries existing Neo4j data to link new extractions with established graph structures.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →