How to Perform Knowledge Graph Construction from Text Using the HugeGraph AI Scheduler

You can construct a knowledge graph from raw text by invoking the GRAPH_EXTRACT flow through the SchedulerSingleton, which orchestrates a pipeline of schema validation, text chunking, and LLM-based entity extraction.

The apache/incubator-hugegraph-ai repository provides a production-ready framework for transforming unstructured documents into structured graph data. Its HugeGraph AI scheduler acts as a lightweight orchestrator that manages flows—self-contained pipelines built with the pycgraph library—to automate knowledge graph construction from text. This implementation handles pipeline reuse, concurrency safety, and result aggregation through a singleton-based scheduler architecture.

Define Your Graph Schema

Before processing text, you must define a JSON schema that describes the vertex and edge types to extract. The schema declares labels, properties, and relationship constraints that guide the LLM's extraction logic.

In hugegraph_llm/src/hugegraph_llm/flows/graph_extract.py, the GraphExtractFlow validates this schema through the SchemaNode before processing begins. A valid schema contains two top-level keys: vertices (defining entity types and their properties) and edges (defining relationship types between vertices).

schema = {
    "vertices": [
        {"vertex_label": "person", "properties": ["name", "occupation"]},
        {"vertex_label": "company", "properties": ["name", "industry", "location"]},
    ],
    "edges": [
        {"edge_label": "works_at", "source_label": "person", "target_label": "company"},
    ],
}

Prepare Input Data for Extraction

The extraction flow requires three critical inputs: the source documents, an example prompt that constrains the LLM output format, and the extraction mode specification.

Input Texts and Example Prompts

The texts parameter accepts a list of raw strings representing the documents to process. The example_prompt parameter provides explicit instructions to the LLM regarding the expected JSON output structure, ensuring the extracted entities match your schema format.

texts = [
    "张三 is a software engineer working at ABC Company.",
    "李四 is 张三's colleague and works as a data scientist.",
    "ABC Company is a tech company headquartered in Beijing."
]

example_prompt = """
Extract entities and relationships that match the given schema.
Return a JSON object with two fields: "vertices" (list of vertex objects) and "edges" (list of edge objects).
"""

Select Extraction Mode

Set the extract_type parameter to determine the extraction operator:

  • "property_graph" – Uses PropertyGraphExtract to generate complex vertex-edge structures with properties
  • "triples" – Uses InfoExtract to generate simple subject-predicate-object triples

The ExtractNode in hugegraph_llm/src/hugegraph_llm/nodes/llm_node/extract_info.py selects the appropriate operator based on this parameter during its node_init phase.

Execute the Graph Extraction Flow

Scheduler Initialization and Flow Lookup

The SchedulerSingleton class defined in hugegraph_llm/src/hugegraph_llm/flows/scheduler.py manages pipeline instances through an internal pipeline_pool dictionary. When you invoke schedule_flow(), the scheduler first checks for an existing pipeline instance to enable reuse and concurrency safety.

If no pipeline exists for FlowName.GRAPH_EXTRACT, the scheduler calls GraphExtractFlow.build_flow to construct a new pipeline. The FlowName enum is defined in hugegraph_llm/src/hugegraph_llm/flows/__init__.py and provides the registry key for available flows.

Pipeline Execution Nodes

The constructed flow wires three nodes sequentially:

  1. SchemaNode – Validates the input schema JSON structure
  2. ChunkSplitNode – Segments large texts into processable chunks
  3. ExtractNode – Loads the LLM and executes the extraction operator

The ExtractNode initializes the LLM client and invokes operator_schedule to run PropertyGraphExtract (when extract_type="property_graph"). This operator processes each text chunk and populates a WkFlowState object with the resulting vertices and edges.

Complete Implementation Example

The following implementation demonstrates the complete workflow using the scheduler API:

from hugegraph_llm.flows.scheduler import SchedulerSingleton
from hugegraph_llm.flows import FlowName

# 1. Define the graph schema

schema = {
    "vertices": [
        {"vertex_label": "person", "properties": ["name", "occupation"]},
        {"vertex_label": "company", "properties": ["name", "industry", "location"]},
    ],
    "edges": [
        {"edge_label": "works_at", "source_label": "person", "target_label": "company"},
    ],
}

# 2. Prepare source documents

texts = [
    "张三 is a software engineer working at ABC Company.",
    "李四 is 张三's colleague and works as a data scientist.",
    "ABC Company is a tech company headquartered in Beijing."
]

# 3. Configure LLM guidance

example_prompt = """
Extract entities and relationships that match the given schema.
Return a JSON object with two fields: "vertices" (list of vertex objects) and "edges" (list of edge objects).
"""

# 4. Execute extraction via scheduler

scheduler = SchedulerSingleton.get_instance()
kg_json = scheduler.schedule_flow(
    FlowName.GRAPH_EXTRACT,
    schema,
    texts,
    example_prompt,
    extract_type="property_graph",
    language="zh"
)

print("Knowledge-graph JSON:")
print(kg_json)

Key Source Files and Architecture

File Path Component Role
hugegraph-llm/src/hugegraph_llm/flows/scheduler.py Implements Scheduler and SchedulerSingleton for pipeline lifecycle management and caching
hugegraph-llm/src/hugegraph_llm/flows/graph_extract.py Defines GraphExtractFlow.build_flow which constructs the SchemaNode → ChunkSplitNode → ExtractNode pipeline
hugegraph_llm/src/hugegraph_llm/nodes/llm_node/extract_info.py Implements ExtractNode which initializes LLM clients and selects extraction operators (PropertyGraphExtract vs InfoExtract)
hugegraph_llm/src/hugegraph_llm/flows/__init__.py Defines the FlowName enum used to address registered flows in the scheduler
hugegraph_llm/src/tests/integration/test_kg_construction.py Integration test demonstrating mock-based verification of the complete knowledge graph construction pipeline

Summary

  • The HugeGraph AI scheduler provides a singleton-based orchestration layer for running extraction flows through SchedulerSingleton.get_instance().
  • Knowledge graph construction from text requires three components: a JSON schema defining vertex/edge types, a list of source documents, and an example prompt guiding LLM output formatting.
  • The GRAPH_EXTRACT flow wires three processing nodes: SchemaNode for validation, ChunkSplitNode for text segmentation, and ExtractNode for LLM-based entity extraction.
  • Set extract_type="property_graph" in schedule_flow() to use the PropertyGraphExtract operator for complex graph structures.
  • Results are aggregated in WkFlowState and returned as JSON containing vertices and edges arrays.

Frequently Asked Questions

What is the difference between "property_graph" and "triples" extraction modes?

The "property_graph" mode uses the PropertyGraphExtract operator to generate vertices and edges with rich property sets, matching complex schema definitions. The "triples" mode uses the InfoExtract operator to generate simple subject-predicate-object statements without additional properties. Choose "property_graph" for production knowledge graphs requiring attributed relationships, and "triples" for basic relationship extraction.

How does the scheduler handle concurrent extraction requests?

The SchedulerSingleton maintains an internal pipeline_pool dictionary that caches instantiated flows. When multiple requests target the same FlowName, the scheduler reuses the existing pipeline instance, ensuring thread-safe operation and reducing initialization overhead. The singleton pattern guarantees that only one scheduler instance manages the pool across the application lifecycle.

Can I use custom LLM models with the ExtractNode?

Yes. The ExtractNode in hugegraph_llm/src/hugegraph_llm/nodes/llm_node/extract_info.py initializes the LLM client during its node_init phase. You can configure the underlying LLM implementation by modifying the operator configuration or extending the PropertyGraphExtract class in hugegraph_llm/operators/llm_op/property_graph_extract.py to support custom endpoints or local models.

Where can I find integration tests for knowledge graph construction?

The integration test suite in hugegraph_llm/src/tests/integration/test_kg_construction.py demonstrates the complete pipeline execution using mocked LLM responses. This test verifies that the scheduler correctly orchestrates the flow and that the resulting JSON output contains properly structured vertices and edges matching the input schema.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →