# How HugeGraph AI Performs Knowledge Graph Construction from Text Using LLMs

> Discover how HugeGraph AI builds knowledge graphs from text using LLMs. Learn about its pipeline for entity and relationship extraction, creating structured JSON outputs.

- Repository: [The Apache Software Foundation/incubator-hugegraph-ai](https://github.com/apache/incubator-hugegraph-ai)
- Tags: how-to-guide
- Published: 2026-02-24

---

**HugeGraph AI constructs knowledge graphs from unstructured text by orchestrating a pipeline that chunks documents, prompts an LLM to extract entities and relationships, and applies regex-based parsing to generate standardized vertex and edge JSON structures.**

Apache HugeGraph AI implements an end-to-end **knowledge graph construction from text using LLMs** through the `GraphExtractFlow` class, which combines document segmentation, prompt-driven extraction, and schema validation to transform raw text into structured graph data. The implementation resides in the `apache/incubator-hugegraph-ai` repository and leverages a node-based pipeline architecture defined in [`hugegraph-llm/src/hugegraph_llm/flows/graph_extract.py`](https://github.com/apache/incubator-hugegraph-ai/blob/main/hugegraph-llm/src/hugegraph_llm/flows/graph_extract.py) (lines 33-49).

## Document Ingestion and Chunking Strategy

The extraction process begins with the `GraphExtractFlow.prepare` method, which populates a `WkFlowInput` object with raw texts, target schema definitions, and extraction mode parameters. This initialization step configures whether the pipeline operates in `"triples"` mode for basic extraction or `"property_graph"` mode for complex property assignments.

For large documents, the `ChunkSplitNode` defined in [`hugegraph-llm/src/hugegraph_llm/nodes/document_node/chunk_split.py`](https://github.com/apache/incubator-hugegraph-ai/blob/main/hugegraph-llm/src/hugegraph_llm/nodes/document_node/chunk_split.py) segments input text into manageable paragraphs using the `ChunkSplitter` utility. This node ensures that downstream LLM calls remain within token limits while preserving semantic coherence across chunks. The chunked content flows into the shared `WkFlowState` object defined in [`hugegraph-llm/src/hugegraph_llm/state/ai_state.py`](https://github.com/apache/incubator-hugegraph-ai/blob/main/hugegraph-llm/src/hugegraph_llm/state/ai_state.py), which maintains state across the pipeline execution.

## Pipeline Architecture and Node Orchestration

The `GraphExtractFlow` orchestrates extraction through a `Scheduler` that manages a `GPipeline` instance from the PyCGraph framework. The scheduler registers specialized nodes including `ExtractNode` from [`hugegraph-llm/src/hugegraph_llm/nodes/llm_node/extract_info.py`](https://github.com/apache/incubator-hugegraph-ai/blob/main/hugegraph-llm/src/hugegraph_llm/nodes/llm_node/extract_info.py) (lines 26-55), which serves as the bridge between workflow state management and the core extraction logic.

Each node follows a strict lifecycle: `init` → `node_init` → `run`. During the `node_init` phase, the `ExtractNode` instantiates the LLM client via `get_chat_llm(llm_settings)`, ensuring that the language model backend is configured once per pipeline execution (lines 34-38). The actual business logic resides in the `InfoExtract` operator located in [`hugegraph-llm/src/hugegraph_llm/operators/llm_op/info_extract.py`](https://github.com/apache/incubator-hugegraph-ai/blob/main/hugegraph-llm/src/hugegraph_llm/operators/llm_op/info_extract.py) (lines 51-199).

## LLM-Powered Triple Extraction

### Dynamic Prompt Generation

The `InfoExtract.extract_triples_by_llm` method generates dynamic prompts based on input context and schema availability. When a schema is provided, the method constructs a "real-result" prompt using `generate_extract_triple_prompt(chunk, schema)` that includes vertex labels, edge definitions, and property constraints. Without a schema, the system falls back to a generic extraction prompt containing only the text content (lines 82-87).

```python

# From hugegraph-llm/src/hugegraph_llm/operators/llm_op/info_extract.py#L82-L87

prompt = generate_extract_triple_prompt(chunk, schema)
if self.example_prompt:
    prompt = self.example_prompt + prompt
return self.llm.generate(prompt=prompt)

```

### LLM Invocation and Response Handling

The `BaseLLM.generate` method executes the prompt against the configured backend, returning raw text containing extracted relationships in the format `(Subject, Predicate, Object) - Label`. This output requires no predefined ontology when operating in schema-less mode, allowing flexible extraction across diverse domains. The LLM instance is created during node initialization via the factory in [`hugegraph-llm/src/hugegraph_llm/models/llms/init_llm.py`](https://github.com/apache/incubator-hugegraph-ai/blob/main/hugegraph-llm/src/hugegraph_llm/models/llms/init_llm.py).

## Regex-Based Post-Processing and Schema Validation

### Schema-Less Triple Parsing

Following the LLM call, `InfoExtract` applies specialized regex parsers to structure the unstructured text output. For schema-less extraction, `extract_triples_by_regex` identifies patterns matching `\((.*?), (.*?), (.*?)\)` and appends raw triples to the workflow state (lines 88-92).

### Schema-Aware Validation

When schema validation is enabled, `extract_triples_by_regex_with_schema` processes the enhanced format `\((.*?), (.*?), (.*?)\) - ([^ ]*)` to validate property-label pairs against the supplied schema constraints, build standardized vertex dictionaries with unique identifiers, and construct edge records while merging duplicate entities (lines 94-149). This validation step ensures that extracted entities conform to the graph schema defined in `WkFlowInput`, filtering out hallucinated labels or invalid property combinations before final assembly.

## Graph Assembly and JSON Serialization

The `GraphExtractFlow.post_deal` method retrieves the populated `WkFlowState` from the pipeline and serializes the graph components into JSON format (lines 69-88). The method extracts `vertices` and `edges` collections from the shared state, producing a standardized output structure suitable for ingestion into HugeGraph.

```python

# From hugegraph-llm/src/hugegraph_llm/flows/graph_extract.py#L69-L88

res = pipeline.getGParamWithNoEmpty("wkflow_state").to_json()
return json.dumps(
    {"vertices": res.get("vertices", []),
     "edges":    res.get("edges",    [])},
    ensure_ascii=False, indent=2)

```

If both collections remain empty after extraction—typically indicating a schema mismatch or prompt failure—the system logs a warning and returns a payload containing a `warning` field to aid debugging.

## Production Optimizations

The architecture implements two critical performance patterns for production deployments. First, the `Scheduler` maintains a `GPipelineManager` that caches compiled pipeline graphs, allowing subsequent requests to reuse instantiated nodes rather than rebuilding the execution graph for each extraction task. Second, the `SchedulerSingleton` ensures thread-safe access to the scheduler instance across concurrent requests, preventing race conditions during high-throughput text processing.

## Complete Implementation Example

The following Python implementation demonstrates the full **knowledge graph construction from text using LLMs** workflow, processing Chinese text with a predefined schema:

```python
from hugegraph_llm.flows.graph_extract import GraphExtractFlow
from hugegraph_llm.state.ai_state import WkFlowInput, WkFlowState

# Input text and schema definition

raw_text = """
张三是一名软件工程师，工作于北京的华为公司。李四是他的同事，负责后端研发。
"""

schema = {
    "vertices": [
        {"vertex_label": "person", "properties": ["name", "occupation"]},
    ],
    "edges": [
        {"edge_label": "colleague", "source_vertex_label": "person",
         "target_vertex_label": "person", "properties": []},
    ],
}

# Initialize and execute extraction flow

flow = GraphExtractFlow()
pipeline = flow.build_flow(
    schema=schema,
    texts=[raw_text],
    example_prompt=None,
    extract_type="triples",
)

pipeline.init()
pipeline.run()

# Retrieve structured graph output

graph_json = flow.post_deal(pipeline)
print(graph_json)

```

**Example Output:**

```json
{
  "vertices": [
    {"id": "person-张三", "name": "张三", "label": "person", "properties": {"occupation": "软件工程师"}},
    {"id": "person-李四", "name": "李四", "label": "person", "properties": {}}
  ],
  "edges": [
    {"start": "person-张三", "end": "person-李四", "type": "colleague", "properties": {}}
  ]
}

```

## Summary

- **HugeGraph AI** implements knowledge graph extraction through the `GraphExtractFlow` pipeline, combining document chunking, LLM prompting, and regex parsing.
- The `InfoExtract` operator in [`hugegraph-llm/src/hugegraph_llm/operators/llm_op/info_extract.py`](https://github.com/apache/incubator-hugegraph-ai/blob/main/hugegraph-llm/src/hugegraph_llm/operators/llm_op/info_extract.py) handles prompt generation and schema-aware validation.
- **Schema validation** occurs during the regex post-processing phase, ensuring extracted entities conform to predefined vertex and edge definitions.
- The system supports both **schema-less** (`triples`) and **schema-constrained** extraction modes via the `extract_type` parameter.
- Production deployments benefit from **pipeline pooling** and thread-safe singleton patterns to minimize initialization overhead.

## Frequently Asked Questions

### What is the difference between "triples" and "property_graph" extraction modes in HugeGraph AI?

The `extract_type` parameter in `GraphExtractFlow.build_flow` determines the output structure. **Triples** mode extracts basic subject-predicate-object relationships without strict schema enforcement, while **property_graph** mode maps extracted entities to specific vertex labels and properties defined in the input schema, enabling complex property assignments and type validation.

### How does HugeGraph AI validate extracted entities against a predefined schema?

During the post-processing phase, the `extract_triples_by_regex_with_schema` method validates each extracted triple against the schema provided in `WkFlowInput`. It checks that vertex labels exist in the schema definition and that properties match allowed fields, filtering out invalid extractions before adding them to the final `vertices` and `edges` collections in `WkFlowState`.

### What LLM backends are compatible with the HugeGraph AI extraction pipeline?

The pipeline supports any LLM backend implementing the `BaseLLM` interface, configured through `get_chat_llm(llm_settings)` in [`hugegraph-llm/src/hugegraph_llm/models/llms/init_llm.py`](https://github.com/apache/incubator-hugegraph-ai/blob/main/hugegraph-llm/src/hugegraph_llm/models/llms/init_llm.py). The `ExtractNode` initializes the LLM client during `node_init`, allowing integration with OpenAI, local models, or custom API endpoints without modifying the core extraction logic.

### How does the pipeline handle large documents that exceed LLM token limits?

The `ChunkSplitNode` in [`hugegraph-llm/src/hugegraph_llm/nodes/document_node/chunk_split.py`](https://github.com/apache/incubator-hugegraph-ai/blob/main/hugegraph-llm/src/hugegraph_llm/nodes/document_node/chunk_split.py) automatically segments large documents into paragraph-level chunks using `ChunkSplitter`. Each chunk processes independently through the LLM extraction phase, with results aggregated into the shared `WkFlowState`, ensuring scalable processing of documents regardless of length.