# How Entity Extraction Works for Knowledge Graph Construction in Graph-RAG Agent

> Learn how entity extraction powers knowledge graph construction in Graph-RAG agents. Discover LLM-driven parsing, deterministic caching, and parallel processing for scalable knowledge graphs.

- Repository: [GLK/graph-rag-agent](https://github.com/1517005260/graph-rag-agent)
- Tags: deep-dive
- Published: 2026-02-22

---

**Entity extraction in the Graph-RAG agent uses an LLM-driven `EntityRelationExtractor` class to parse raw text into structured entities and relationships, with deterministic caching and parallel processing to build scalable knowledge graphs.**

The `graph-rag-agent` repository implements a complete pipeline for transforming unstructured documents into queryable knowledge graphs. At the heart of this system lies the entity extraction mechanism, which identifies semantic entities and their relationships before persisting them to Neo4j.

## The EntityRelationExtractor Architecture

The extraction workflow centers on the `EntityRelationExtractor` class defined in [`graphrag_agent/graph/extraction/entity_extractor.py`](https://github.com/1517005260/graph-rag-agent/blob/main/graphrag_agent/graph/extraction/entity_extractor.py). This orchestrator manages LLM interactions, result caching, and concurrent processing to convert text chunks into structured graph triples.

### Initialization and Prompt Engineering

The extractor initializes by composing a LangChain pipeline that binds system instructions, chat history, and human prompts to an LLM instance. According to the source code in `EntityRelationExtractor.__init__` (line 22), the constructor accepts:

- An LLM instance (any LangChain-compatible model)
- System and human prompt templates
- Predefined `entity_types` and `relationship_types` lists

These parameters form a `ChatPromptTemplate` and a processing chain:

```python
self.chain = self.chat_prompt | self.llm

```

The prompt templates guide the LLM to output JSON containing two lists: `"entities"` and `"relationships"`, ensuring consistent parsing regardless of input text variability.

### Deterministic Caching Strategy

To eliminate redundant LLM calls, the system implements hash-based caching. The `_generate_cache_key` method (line 77) creates deterministic hashes for each text chunk using `generate_hash`. Results persist as pickled objects at `<cache_dir>/<hash>.pkl` via `_save_to_cache` and `_load_from_cache` (lines 101-139).

When `process_chunks` encounters a previously processed chunk, it loads the structured result from disk rather than invoking the LLM. This design guarantees that identical text never triggers duplicate API calls, significantly reducing computational costs during iterative graph development.

### Parallel Processing Pipeline

The `process_chunks` method (line 45) orchestrates batch extraction using `ThreadPoolExecutor`. The workflow:

1. Generates cache keys for all incoming chunks
2. Filters out cached results
3. Processes uncached chunks concurrently across configurable `max_workers`
4. Implements automatic retry logic (up to three attempts) for failed chunks
5. Fires progress callbacks for real-time monitoring

For memory-constrained environments, `stream_process_large_files` (line 371) yields chunks lazily while maintaining the same caching logic, enabling processing of multi-gigabyte documents without loading entire files into RAM.

## From Raw Text to Structured Triples

### Chunk-Level Processing

Individual text segments flow through `_process_single_chunk` (line 335), which:

- Sends the chunk through the LangChain pipeline
- Receives raw string output encoding entities and relationships
- Parses the response into structured Python dictionaries containing entity IDs, labels, descriptions, and relationship mappings

Each processed chunk returns a tuple with the file metadata and a dictionary separating `entities` (with `id`, `label`, `description` fields) from `relationships` (with `source`, `target`, `type` fields).

### Integration with DynamicKnowledgeGraphBuilder

Extracted triples feed into the `DynamicKnowledgeGraphBuilder` class located in [`graphrag_agent/search/tool/reasoning/kg_builder.py`](https://github.com/1517005260/graph-rag-agent/blob/main/graphrag_agent/search/tool/reasoning/kg_builder.py). This builder receives the extractor output and incrementally constructs an in-memory `networkx.DiGraph`:

```python
kg_builder = DynamicKnowledgeGraphBuilder(
    graph=neo4j_driver,
    entity_relation_extractor=extractor,
)
subgraph = kg_builder.build_query_graph(query, seed_entities)

```

The builder expands the graph by querying the Neo4j backend for neighboring entities, linking seed entities from user queries with extractor-derived entities via discovered relationships. Finally, the `GraphWriter` utility (used in [`incremental_graph_builder.py`](https://github.com/1517005260/graph-rag-agent/blob/main/incremental_graph_builder.py) and [`build_graph.py`](https://github.com/1517005260/graph-rag-agent/blob/main/build_graph.py)) persists the constructed graph to the Neo4j database.

## Code Implementation Examples

### Instantiating the Extractor

The following implementation demonstrates configuring the extractor with custom entity types and relationship schemas:

```python
from graphrag_agent.graph.extraction import EntityRelationExtractor
from langchain.llms import OpenAI

# Configure LLM and prompt templates

llm = OpenAI(model_name="gpt-4o-mini")
system_tpl = """You are an expert KG extractor. Identify entities of type PERSON, ORG, EVENT and the relations WORKS_FOR, PART_OF in the supplied text. Return a JSON object with two lists: "entities" and "relationships"."""
human_tpl = "{text}"

# Initialize extractor with caching and parallelism

extractor = EntityRelationExtractor(
    llm=llm,
    system_template=system_tpl,
    human_template=human_tpl,
    entity_types=["PERSON", "ORG", "EVENT"],
    relationship_types=["WORKS_FOR", "PART_OF"],
    cache_dir="./cache/graph",
    max_workers=4,
    batch_size=5,
)

# Process document chunks

file_contents = [("doc.txt", "utf-8", ["Alice works for Acme Corp."])]
results = extractor.process_chunks(file_contents)

```

### Building and Persisting the Knowledge Graph

After extraction, integrate with the graph builder and Neo4j backend:

```python
from graphrag_agent.graph.builder import DynamicKnowledgeGraphBuilder
from graphrag_agent.integrations.build.incremental_graph_builder import GraphWriter
from neo4j import GraphDatabase

# Establish Neo4j connection

driver = GraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "password"))

# Configure builder with extractor instance

kg_builder = DynamicKnowledgeGraphBuilder(
    graph=driver,
    entity_relation_extractor=extractor,
)

# Generate subgraph around seed entities

subgraph = kg_builder.build_query_graph(
    query="What projects is Alice involved in?",
    entities=["Alice"],
    depth=2,
)

# Persist to database

writer = GraphWriter(driver)
writer.write_graph(subgraph)

```

## Performance Optimizations

**LLM-Centric Design**: By delegating extraction to language models rather than regex patterns, the system handles open-ended text without hard-coded rules. Prompt engineering constrains output format while maintaining flexibility for diverse domains.

**Deterministic Caching**: The hash-based cache ensures idempotent processing. Re-running the pipeline on identical documents costs only disk I/O, not LLM tokens.

**Concurrent Execution**: `ThreadPoolExecutor` with configurable `max_workers` maximizes throughput on multi-core systems while respecting API rate limits through batching controls.

**Streaming Architecture**: The `stream_process_large_files` method processes documents chunk-by-chunk, maintaining constant memory footprint regardless of file size. This enables enterprise-scale ingestion of multi-gigabyte document corpora.

## Summary

- The `EntityRelationExtractor` class in [`entity_extractor.py`](https://github.com/1517005260/graph-rag-agent/blob/main/entity_extractor.py) orchestrates LLM-driven extraction using configurable prompt templates for entities and relationships.
- Deterministic hashing eliminates redundant processing through pickle-based caching stored in `cache_dir`.
- `ThreadPoolExecutor` enables parallel processing of chunks with automatic retry logic for failed extractions.
- Extracted triples flow into `DynamicKnowledgeGraphBuilder` for incremental graph construction before persistence via `GraphWriter`.
- Streaming methods support processing of multi-GB documents without loading entire files into memory.

## Frequently Asked Questions

### What LLM models work best with EntityRelationExtractor?

The extractor accepts any LangChain-compatible LLM, including OpenAI's GPT-4, Anthropic's Claude, or local models via Ollama. The system relies on the model's ability to follow structured output instructions in the system prompt, returning valid JSON with `entities` and `relationships` keys. Higher-capacity models generally produce more accurate relationship extraction, though smaller models like GPT-4o-mini suffice for straightforward entity typing tasks.

### How does the caching mechanism handle identical text chunks?

The `_generate_cache_key` method computes a deterministic hash of each text chunk using the `generate_hash` utility. When `process_chunks` encounters a hash matching an existing `.pkl` file in `cache_dir`, it loads the pickled result via `_load_from_cache` rather than invoking the LLM. This guarantees that reprocessing the same document or overlapping chunks costs only disk I/O, not API calls.

### Can the extractor handle multi-GB documents?

Yes. The `stream_process_large_files` method (line 371) implements lazy chunk yielding that processes documents incrementally. By yielding chunks one at a time and maintaining the same hash-based cache checks, the system processes arbitrarily large files without loading them entirely into RAM. This streaming architecture supports enterprise-scale document ingestion pipelines.

### How are extracted entities mapped to Neo4j nodes?

The `DynamicKnowledgeGraphBuilder` receives extraction results and constructs a `networkx.DiGraph` in memory, where entities become nodes and relationships become directed edges. The `GraphWriter` utility then maps these graph elements to Neo4j's property graph model, creating nodes with labels corresponding to `entity_types` and relationships with types matching the extracted `relationship_types`. The builder also queries existing Neo4j data to link new extractions with established graph structures.