How HugeGraph AI Performs Knowledge Graph Construction from Text Using LLMs
HugeGraph AI constructs knowledge graphs from unstructured text by orchestrating a pipeline that chunks documents, prompts an LLM to extract entities and relationships, and applies regex-based parsing to generate standardized vertex and edge JSON structures.
Apache HugeGraph AI implements an end-to-end knowledge graph construction from text using LLMs through the GraphExtractFlow class, which combines document segmentation, prompt-driven extraction, and schema validation to transform raw text into structured graph data. The implementation resides in the apache/incubator-hugegraph-ai repository and leverages a node-based pipeline architecture defined in hugegraph-llm/src/hugegraph_llm/flows/graph_extract.py (lines 33-49).
Document Ingestion and Chunking Strategy
The extraction process begins with the GraphExtractFlow.prepare method, which populates a WkFlowInput object with raw texts, target schema definitions, and extraction mode parameters. This initialization step configures whether the pipeline operates in "triples" mode for basic extraction or "property_graph" mode for complex property assignments.
For large documents, the ChunkSplitNode defined in hugegraph-llm/src/hugegraph_llm/nodes/document_node/chunk_split.py segments input text into manageable paragraphs using the ChunkSplitter utility. This node ensures that downstream LLM calls remain within token limits while preserving semantic coherence across chunks. The chunked content flows into the shared WkFlowState object defined in hugegraph-llm/src/hugegraph_llm/state/ai_state.py, which maintains state across the pipeline execution.
Pipeline Architecture and Node Orchestration
The GraphExtractFlow orchestrates extraction through a Scheduler that manages a GPipeline instance from the PyCGraph framework. The scheduler registers specialized nodes including ExtractNode from hugegraph-llm/src/hugegraph_llm/nodes/llm_node/extract_info.py (lines 26-55), which serves as the bridge between workflow state management and the core extraction logic.
Each node follows a strict lifecycle: init → node_init → run. During the node_init phase, the ExtractNode instantiates the LLM client via get_chat_llm(llm_settings), ensuring that the language model backend is configured once per pipeline execution (lines 34-38). The actual business logic resides in the InfoExtract operator located in hugegraph-llm/src/hugegraph_llm/operators/llm_op/info_extract.py (lines 51-199).
LLM-Powered Triple Extraction
Dynamic Prompt Generation
The InfoExtract.extract_triples_by_llm method generates dynamic prompts based on input context and schema availability. When a schema is provided, the method constructs a "real-result" prompt using generate_extract_triple_prompt(chunk, schema) that includes vertex labels, edge definitions, and property constraints. Without a schema, the system falls back to a generic extraction prompt containing only the text content (lines 82-87).
# From hugegraph-llm/src/hugegraph_llm/operators/llm_op/info_extract.py#L82-L87
prompt = generate_extract_triple_prompt(chunk, schema)
if self.example_prompt:
prompt = self.example_prompt + prompt
return self.llm.generate(prompt=prompt)
LLM Invocation and Response Handling
The BaseLLM.generate method executes the prompt against the configured backend, returning raw text containing extracted relationships in the format (Subject, Predicate, Object) - Label. This output requires no predefined ontology when operating in schema-less mode, allowing flexible extraction across diverse domains. The LLM instance is created during node initialization via the factory in hugegraph-llm/src/hugegraph_llm/models/llms/init_llm.py.
Regex-Based Post-Processing and Schema Validation
Schema-Less Triple Parsing
Following the LLM call, InfoExtract applies specialized regex parsers to structure the unstructured text output. For schema-less extraction, extract_triples_by_regex identifies patterns matching \((.*?), (.*?), (.*?)\) and appends raw triples to the workflow state (lines 88-92).
Schema-Aware Validation
When schema validation is enabled, extract_triples_by_regex_with_schema processes the enhanced format \((.*?), (.*?), (.*?)\) - ([^ ]*) to validate property-label pairs against the supplied schema constraints, build standardized vertex dictionaries with unique identifiers, and construct edge records while merging duplicate entities (lines 94-149). This validation step ensures that extracted entities conform to the graph schema defined in WkFlowInput, filtering out hallucinated labels or invalid property combinations before final assembly.
Graph Assembly and JSON Serialization
The GraphExtractFlow.post_deal method retrieves the populated WkFlowState from the pipeline and serializes the graph components into JSON format (lines 69-88). The method extracts vertices and edges collections from the shared state, producing a standardized output structure suitable for ingestion into HugeGraph.
# From hugegraph-llm/src/hugegraph_llm/flows/graph_extract.py#L69-L88
res = pipeline.getGParamWithNoEmpty("wkflow_state").to_json()
return json.dumps(
{"vertices": res.get("vertices", []),
"edges": res.get("edges", [])},
ensure_ascii=False, indent=2)
If both collections remain empty after extraction—typically indicating a schema mismatch or prompt failure—the system logs a warning and returns a payload containing a warning field to aid debugging.
Production Optimizations
The architecture implements two critical performance patterns for production deployments. First, the Scheduler maintains a GPipelineManager that caches compiled pipeline graphs, allowing subsequent requests to reuse instantiated nodes rather than rebuilding the execution graph for each extraction task. Second, the SchedulerSingleton ensures thread-safe access to the scheduler instance across concurrent requests, preventing race conditions during high-throughput text processing.
Complete Implementation Example
The following Python implementation demonstrates the full knowledge graph construction from text using LLMs workflow, processing Chinese text with a predefined schema:
from hugegraph_llm.flows.graph_extract import GraphExtractFlow
from hugegraph_llm.state.ai_state import WkFlowInput, WkFlowState
# Input text and schema definition
raw_text = """
张三是一名软件工程师,工作于北京的华为公司。李四是他的同事,负责后端研发。
"""
schema = {
"vertices": [
{"vertex_label": "person", "properties": ["name", "occupation"]},
],
"edges": [
{"edge_label": "colleague", "source_vertex_label": "person",
"target_vertex_label": "person", "properties": []},
],
}
# Initialize and execute extraction flow
flow = GraphExtractFlow()
pipeline = flow.build_flow(
schema=schema,
texts=[raw_text],
example_prompt=None,
extract_type="triples",
)
pipeline.init()
pipeline.run()
# Retrieve structured graph output
graph_json = flow.post_deal(pipeline)
print(graph_json)
Example Output:
{
"vertices": [
{"id": "person-张三", "name": "张三", "label": "person", "properties": {"occupation": "软件工程师"}},
{"id": "person-李四", "name": "李四", "label": "person", "properties": {}}
],
"edges": [
{"start": "person-张三", "end": "person-李四", "type": "colleague", "properties": {}}
]
}
Summary
- HugeGraph AI implements knowledge graph extraction through the
GraphExtractFlowpipeline, combining document chunking, LLM prompting, and regex parsing. - The
InfoExtractoperator inhugegraph-llm/src/hugegraph_llm/operators/llm_op/info_extract.pyhandles prompt generation and schema-aware validation. - Schema validation occurs during the regex post-processing phase, ensuring extracted entities conform to predefined vertex and edge definitions.
- The system supports both schema-less (
triples) and schema-constrained extraction modes via theextract_typeparameter. - Production deployments benefit from pipeline pooling and thread-safe singleton patterns to minimize initialization overhead.
Frequently Asked Questions
What is the difference between "triples" and "property_graph" extraction modes in HugeGraph AI?
The extract_type parameter in GraphExtractFlow.build_flow determines the output structure. Triples mode extracts basic subject-predicate-object relationships without strict schema enforcement, while property_graph mode maps extracted entities to specific vertex labels and properties defined in the input schema, enabling complex property assignments and type validation.
How does HugeGraph AI validate extracted entities against a predefined schema?
During the post-processing phase, the extract_triples_by_regex_with_schema method validates each extracted triple against the schema provided in WkFlowInput. It checks that vertex labels exist in the schema definition and that properties match allowed fields, filtering out invalid extractions before adding them to the final vertices and edges collections in WkFlowState.
What LLM backends are compatible with the HugeGraph AI extraction pipeline?
The pipeline supports any LLM backend implementing the BaseLLM interface, configured through get_chat_llm(llm_settings) in hugegraph-llm/src/hugegraph_llm/models/llms/init_llm.py. The ExtractNode initializes the LLM client during node_init, allowing integration with OpenAI, local models, or custom API endpoints without modifying the core extraction logic.
How does the pipeline handle large documents that exceed LLM token limits?
The ChunkSplitNode in hugegraph-llm/src/hugegraph_llm/nodes/document_node/chunk_split.py automatically segments large documents into paragraph-level chunks using ChunkSplitter. Each chunk processes independently through the LLM extraction phase, with results aggregated into the shared WkFlowState, ensuring scalable processing of documents regardless of length.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →