# How to Perform Knowledge Graph Construction from Text Using the HugeGraph AI Scheduler

> Learn to build knowledge graphs from text using the HugeGraph AI scheduler. Our guide details the GRAPH_EXTRACT flow for seamless entity extraction and schema validation.

- Repository: [The Apache Software Foundation/incubator-hugegraph-ai](https://github.com/apache/incubator-hugegraph-ai)
- Tags: how-to-guide
- Published: 2026-02-24

---

**You can construct a knowledge graph from raw text by invoking the `GRAPH_EXTRACT` flow through the `SchedulerSingleton`, which orchestrates a pipeline of schema validation, text chunking, and LLM-based entity extraction.**

The `apache/incubator-hugegraph-ai` repository provides a production-ready framework for transforming unstructured documents into structured graph data. Its **HugeGraph AI scheduler** acts as a lightweight orchestrator that manages **flows**—self-contained pipelines built with the `pycgraph` library—to automate knowledge graph construction from text. This implementation handles pipeline reuse, concurrency safety, and result aggregation through a singleton-based scheduler architecture.

## Define Your Graph Schema

Before processing text, you must define a JSON schema that describes the vertex and edge types to extract. The schema declares labels, properties, and relationship constraints that guide the LLM's extraction logic.

In [`hugegraph_llm/src/hugegraph_llm/flows/graph_extract.py`](https://github.com/apache/incubator-hugegraph-ai/blob/main/hugegraph_llm/src/hugegraph_llm/flows/graph_extract.py), the `GraphExtractFlow` validates this schema through the **SchemaNode** before processing begins. A valid schema contains two top-level keys: `vertices` (defining entity types and their properties) and `edges` (defining relationship types between vertices).

```python
schema = {
    "vertices": [
        {"vertex_label": "person", "properties": ["name", "occupation"]},
        {"vertex_label": "company", "properties": ["name", "industry", "location"]},
    ],
    "edges": [
        {"edge_label": "works_at", "source_label": "person", "target_label": "company"},
    ],
}

```

## Prepare Input Data for Extraction

The extraction flow requires three critical inputs: the source documents, an example prompt that constrains the LLM output format, and the extraction mode specification.

### Input Texts and Example Prompts

The `texts` parameter accepts a list of raw strings representing the documents to process. The `example_prompt` parameter provides explicit instructions to the LLM regarding the expected JSON output structure, ensuring the extracted entities match your schema format.

```python
texts = [
    "张三 is a software engineer working at ABC Company.",
    "李四 is 张三's colleague and works as a data scientist.",
    "ABC Company is a tech company headquartered in Beijing."
]

example_prompt = """
Extract entities and relationships that match the given schema.
Return a JSON object with two fields: "vertices" (list of vertex objects) and "edges" (list of edge objects).
"""

```

### Select Extraction Mode

Set the `extract_type` parameter to determine the extraction operator:
- **`"property_graph"`** – Uses `PropertyGraphExtract` to generate complex vertex-edge structures with properties
- **`"triples"`** – Uses `InfoExtract` to generate simple subject-predicate-object triples

The **ExtractNode** in [`hugegraph_llm/src/hugegraph_llm/nodes/llm_node/extract_info.py`](https://github.com/apache/incubator-hugegraph-ai/blob/main/hugegraph_llm/src/hugegraph_llm/nodes/llm_node/extract_info.py) selects the appropriate operator based on this parameter during its `node_init` phase.

## Execute the Graph Extraction Flow

### Scheduler Initialization and Flow Lookup

The **SchedulerSingleton** class defined in [`hugegraph_llm/src/hugegraph_llm/flows/scheduler.py`](https://github.com/apache/incubator-hugegraph-ai/blob/main/hugegraph_llm/src/hugegraph_llm/flows/scheduler.py) manages pipeline instances through an internal `pipeline_pool` dictionary. When you invoke `schedule_flow()`, the scheduler first checks for an existing pipeline instance to enable reuse and concurrency safety.

If no pipeline exists for `FlowName.GRAPH_EXTRACT`, the scheduler calls `GraphExtractFlow.build_flow` to construct a new pipeline. The `FlowName` enum is defined in [`hugegraph_llm/src/hugegraph_llm/flows/__init__.py`](https://github.com/apache/incubator-hugegraph-ai/blob/main/hugegraph_llm/src/hugegraph_llm/flows/__init__.py) and provides the registry key for available flows.

### Pipeline Execution Nodes

The constructed flow wires three nodes sequentially:
1. **SchemaNode** – Validates the input schema JSON structure
2. **ChunkSplitNode** – Segments large texts into processable chunks
3. **ExtractNode** – Loads the LLM and executes the extraction operator

The **ExtractNode** initializes the LLM client and invokes `operator_schedule` to run `PropertyGraphExtract` (when `extract_type="property_graph"`). This operator processes each text chunk and populates a **WkFlowState** object with the resulting vertices and edges.

## Complete Implementation Example

The following implementation demonstrates the complete workflow using the scheduler API:

```python
from hugegraph_llm.flows.scheduler import SchedulerSingleton
from hugegraph_llm.flows import FlowName

# 1. Define the graph schema

schema = {
    "vertices": [
        {"vertex_label": "person", "properties": ["name", "occupation"]},
        {"vertex_label": "company", "properties": ["name", "industry", "location"]},
    ],
    "edges": [
        {"edge_label": "works_at", "source_label": "person", "target_label": "company"},
    ],
}

# 2. Prepare source documents

texts = [
    "张三 is a software engineer working at ABC Company.",
    "李四 is 张三's colleague and works as a data scientist.",
    "ABC Company is a tech company headquartered in Beijing."
]

# 3. Configure LLM guidance

example_prompt = """
Extract entities and relationships that match the given schema.
Return a JSON object with two fields: "vertices" (list of vertex objects) and "edges" (list of edge objects).
"""

# 4. Execute extraction via scheduler

scheduler = SchedulerSingleton.get_instance()
kg_json = scheduler.schedule_flow(
    FlowName.GRAPH_EXTRACT,
    schema,
    texts,
    example_prompt,
    extract_type="property_graph",
    language="zh"
)

print("Knowledge-graph JSON:")
print(kg_json)

```

## Key Source Files and Architecture

| File Path | Component Role |
|-----------|----------------|
| [`hugegraph-llm/src/hugegraph_llm/flows/scheduler.py`](https://github.com/apache/incubator-hugegraph-ai/blob/main/hugegraph-llm/src/hugegraph_llm/flows/scheduler.py) | Implements `Scheduler` and `SchedulerSingleton` for pipeline lifecycle management and caching |
| [`hugegraph-llm/src/hugegraph_llm/flows/graph_extract.py`](https://github.com/apache/incubator-hugegraph-ai/blob/main/hugegraph-llm/src/hugegraph_llm/flows/graph_extract.py) | Defines `GraphExtractFlow.build_flow` which constructs the SchemaNode → ChunkSplitNode → ExtractNode pipeline |
| [`hugegraph_llm/src/hugegraph_llm/nodes/llm_node/extract_info.py`](https://github.com/apache/incubator-hugegraph-ai/blob/main/hugegraph_llm/src/hugegraph_llm/nodes/llm_node/extract_info.py) | Implements `ExtractNode` which initializes LLM clients and selects extraction operators (`PropertyGraphExtract` vs `InfoExtract`) |
| [`hugegraph_llm/src/hugegraph_llm/flows/__init__.py`](https://github.com/apache/incubator-hugegraph-ai/blob/main/hugegraph_llm/src/hugegraph_llm/flows/__init__.py) | Defines the `FlowName` enum used to address registered flows in the scheduler |
| [`hugegraph_llm/src/tests/integration/test_kg_construction.py`](https://github.com/apache/incubator-hugegraph-ai/blob/main/hugegraph_llm/src/tests/integration/test_kg_construction.py) | Integration test demonstrating mock-based verification of the complete knowledge graph construction pipeline |

## Summary

- The **HugeGraph AI scheduler** provides a singleton-based orchestration layer for running extraction flows through `SchedulerSingleton.get_instance()`.
- **Knowledge graph construction from text** requires three components: a JSON schema defining vertex/edge types, a list of source documents, and an example prompt guiding LLM output formatting.
- The **GRAPH_EXTRACT** flow wires three processing nodes: `SchemaNode` for validation, `ChunkSplitNode` for text segmentation, and `ExtractNode` for LLM-based entity extraction.
- Set `extract_type="property_graph"` in `schedule_flow()` to use the `PropertyGraphExtract` operator for complex graph structures.
- Results are aggregated in `WkFlowState` and returned as JSON containing `vertices` and `edges` arrays.

## Frequently Asked Questions

### What is the difference between "property_graph" and "triples" extraction modes?

The `"property_graph"` mode uses the `PropertyGraphExtract` operator to generate vertices and edges with rich property sets, matching complex schema definitions. The `"triples"` mode uses the `InfoExtract` operator to generate simple subject-predicate-object statements without additional properties. Choose `"property_graph"` for production knowledge graphs requiring attributed relationships, and `"triples"` for basic relationship extraction.

### How does the scheduler handle concurrent extraction requests?

The `SchedulerSingleton` maintains an internal `pipeline_pool` dictionary that caches instantiated flows. When multiple requests target the same `FlowName`, the scheduler reuses the existing pipeline instance, ensuring thread-safe operation and reducing initialization overhead. The singleton pattern guarantees that only one scheduler instance manages the pool across the application lifecycle.

### Can I use custom LLM models with the ExtractNode?

Yes. The `ExtractNode` in [`hugegraph_llm/src/hugegraph_llm/nodes/llm_node/extract_info.py`](https://github.com/apache/incubator-hugegraph-ai/blob/main/hugegraph_llm/src/hugegraph_llm/nodes/llm_node/extract_info.py) initializes the LLM client during its `node_init` phase. You can configure the underlying LLM implementation by modifying the operator configuration or extending the `PropertyGraphExtract` class in [`hugegraph_llm/operators/llm_op/property_graph_extract.py`](https://github.com/apache/incubator-hugegraph-ai/blob/main/hugegraph_llm/operators/llm_op/property_graph_extract.py) to support custom endpoints or local models.

### Where can I find integration tests for knowledge graph construction?

The integration test suite in [`hugegraph_llm/src/tests/integration/test_kg_construction.py`](https://github.com/apache/incubator-hugegraph-ai/blob/main/hugegraph_llm/src/tests/integration/test_kg_construction.py) demonstrates the complete pipeline execution using mocked LLM responses. This test verifies that the scheduler correctly orchestrates the flow and that the resulting JSON output contains properly structured vertices and edges matching the input schema.