deep-dive

Main Modules in Apache HugeGraph AI: Python SDK, ML, LLM, and Distributed Computing

February 24, 2026 apache/incubator-hugegraph-ai ↗

Apache HugeGraph AI is organized as a workspace-style monorepo containing four main modules: hugegraph-python-client for database SDK operations, hugegraph-ml for deep learning on graphs, hugegraph-llm for LLM integration and Graph-RAG, and vermeer-python-client for distributed graph computing.

Apache HugeGraph AI provides a comprehensive Python ecosystem for building AI-powered graph applications. The repository structures its functionality across four independent yet interoperable packages, each installable separately via pip but designed to work together in unified workflows. Understanding these main modules in Apache HugeGraph AI is essential for architecting solutions that span from basic CRUD operations to advanced natural language querying.

The Four Main Modules

The project divides its capabilities into four distinct Python packages, each with its own pyproject.toml and specific focus area.

hugegraph-python-client: Core Database SDK

The hugegraph-python-client module provides the low-level Python SDK for interacting with the HugeGraph database server. It handles schema management, CRUD operations, and Gremlin query execution through a RESTful API wrapper.

The primary entry point is PyHugeClient located in hugegraph-python-client/src/pyhugegraph/client.py. This class manages connections to the HugeGraph server and provides access to schema builders and graph traversals.

Key capabilities include:

Schema definition and property key management
Vertex and edge CRUD operations
Gremlin query execution
GraphSpace support for multi-tenant environments

hugegraph-llm: Large Language Model Integration

The hugegraph-llm module enables AI-driven graph applications through Large Language Model (LLM) integration. It powers Graph-RAG (Retrieval-Augmented Generation), knowledge graph construction from unstructured text, and natural language to Gremlin translation.

The central orchestration happens in hugegraph-llm/src/hugegraph_llm/flows/scheduler.py via the SchedulerSingleton class. This scheduler manages reusable pipelines including RAGGraphOnlyFlow, BuildVectorIndexFlow, and Text2GremlinFlow.

Key features include:

Flow-based orchestration using GPipeline from pycgraph
Streaming response support for interactive applications
Vector index construction for semantic search
AI agent capabilities for autonomous graph operations

hugegraph-ml: Graph Machine Learning

The hugegraph-ml module offers a deep learning toolkit for graph analytics, built on DGL (Deep Graph Library) and PyTorch. It supports node classification, graph classification, link prediction, and graph embedding tasks.

Example implementations reside in hugegraph-ml/src/hugegraph_ml/examples/, such as dgi_example.py which demonstrates Deep Graph Infomax training. The module uses HugeGraph2DGL converters to transform HugeGraph data into DGL-compatible formats.

Key components include:

Pre-built models: DGI, GRAND, and other GNN architectures
Tasks: NodeEmbed, NodeClassify, and link prediction
Direct data conversion from HugeGraph to DGL tensors
PyTorch-based training pipelines

vermeer-python-client: Distributed Computing

The vermeer-python-client module provides connectivity to the Vermeer distributed graph computing engine. Unlike the standard client that handles online transactions, Vermeer executes heavyweight offline graph algorithms across clusters.

The entry point is PyVermeerClient in vermeer-python-client/src/pyvermeer/client/client.py. It wraps the Vermeer REST API to submit, monitor, and retrieve results from batch processing jobs.

Use cases include:

Distributed PageRank and community detection
Custom user-defined algorithm jobs
Large-scale graph ETL and preprocessing
Offline analytics on massive graph datasets

Architectural Integration: How the Modules Work Together

While each module functions independently, the main modules in Apache HugeGraph AI are designed for unified workflows. A typical end-to-end pipeline might execute the following steps:

Data Ingestion: Use hugegraph-python-client to define schemas and load vertices/edges via PyHugeClient.
Preprocessing: Submit batch jobs through vermeer-python-client for large-scale data cleaning or algorithmic preprocessing.
Machine Learning: Extract data using HugeGraph2DGL in hugegraph-ml to train node embeddings or classification models.
AI Serving: Deploy hugegraph-llm to build vector indices and serve Graph-RAG queries through the SchedulerSingleton.

This modular architecture allows developers to import only required packages while maintaining the ability to scale from single-node experiments to distributed production deployments.

Practical Code Examples

Connecting to HugeGraph and Creating Schema

from pyhugegraph.client import PyHugeClient

client = PyHugeClient(
    host="127.0.0.1",
    port="8080",
    user="admin",
    pwd="admin",
    graph="hugegraph",
    graphspace="DEFAULT",
)

# Define schema

schema = client.schema()
schema.propertyKey("name").asText().ifNotExist().create()
schema.vertexLabel("Person").properties("name").usePrimaryKeyId().primaryKeys("name").ifNotExist().create()
schema.edgeLabel("ActedIn").sourceLabel("Person").targetLabel("Movie").ifNotExist().create()

# Insert data

g = client.graph()
al = g.addVertex("Person", {"name": "Al Pacino"})
godfather = g.addVertex("Movie", {"name": "The Godfather"})
g.addEdge("ActedIn", al.id, godfather.id, {})
g.close()

This example from hugegraph-python-client/src/pyhugegraph/client.py demonstrates schema creation and basic CRUD operations.

Training Node Embeddings with DGI

from hugegraph_ml.data.hugegraph2dgl import HugeGraph2DGL
from hugegraph_ml.models.dgi import DGI
from hugegraph_ml.tasks.node_embed import NodeEmbed

hg2d = HugeGraph2DGL()
graph = hg2d.convert_graph(vertex_label="CORA_vertex", edge_label="CORA_edge")
model = DGI(n_in_feats=graph.ndata["feat"].shape[1])

embed_task = NodeEmbed(graph=graph, model=model)
embedded = embed_task.train_and_embed(add_self_loop=True, n_epochs=300, patience=30)

print("Embedding shape:", embedded.ndata["feat"].shape)

This snippet from hugegraph-ml/src/hugegraph_ml/examples/dgi_example.py shows converting HugeGraph data to DGL format and training Deep Graph Infomax embeddings.

Executing Graph-RAG Queries

from hugegraph_llm.flows.scheduler import SchedulerSingleton

scheduler = SchedulerSingleton.get_instance()
result = scheduler.schedule_flow(
    "rag_graph_only",
    query="Tell me about the movie The Godfather.",
    graph_only_answer=True,
    vector_only_answer=False,
    raw_answer=False,
)
print(result.get("graph_only_answer"))

The SchedulerSingleton in hugegraph-llm/src/hugegraph_llm/flows/scheduler.py orchestrates LLM flows for retrieval-augmented generation against graph data.

Submitting Distributed Jobs to Vermeer

from pyvermeer.client.client import PyVermeerClient
from pyvermeer.structure.task_data import TaskCreateRequest

client = PyVermeerClient(ip="127.0.0.1", port=8688, token="", log_level="DEBUG")
task = TaskCreateRequest(
    task_type="load",
    graph_name="DEFAULT-example",
    params={
        "load.hg_pd_peers": '["127.0.0.1:8686"]',
        "load.hugegraph_name": "DEFAULT/example/g",
        "load.hugegraph_username": "admin",
        "load.hugegraph_password": "admin",
        "load.type": "hugegraph",
        "load.parallel": "10",
    },
)
resp = client.tasks.create_task(create_task=task)
print(resp.to_dict())

This example utilizes vermeer-python-client/src/pyvermeer/client/client.py to submit batch loading tasks to the Vermeer distributed engine.

Summary

hugegraph-python-client provides the foundational SDK for schema management, CRUD operations, and Gremlin queries against HugeGraph servers.
hugegraph-ml delivers deep learning capabilities via DGL and PyTorch for node embeddings, classification, and link prediction tasks.
hugegraph-llm enables AI-driven applications through LLM integration, supporting Graph-RAG, natural language querying, and automated knowledge graph construction via flow-based scheduling.
vermeer-python-client connects to distributed Vermeer clusters for executing large-scale offline graph algorithms and batch processing jobs.

Together, these main modules in Apache HugeGraph AI create a complete pipeline from data ingestion to intelligent query serving.

Frequently Asked Questions

What is the difference between hugegraph-python-client and vermeer-python-client?

hugegraph-python-client is designed for online transactional workloads against a HugeGraph server, handling real-time CRUD operations, schema changes, and interactive Gremlin queries. vermeer-python-client connects to the Vermeer distributed computing engine for offline analytical workloads, such as running PageRank across billions of edges or executing custom batch algorithms on clusters. Use the former for application database access and the latter for heavy-duty graph analytics.

How does hugegraph-llm handle LLM integration and Graph-RAG?

The module uses a scheduler-based architecture centered on SchedulerSingleton in hugegraph-llm/src/hugegraph_llm/flows/scheduler.py. It registers reusable flows like RAGGraphOnlyFlow and BuildVectorIndexFlow that combine vector similarity search with graph traversal. The scheduler supports both synchronous and streaming responses, allowing applications to retrieve context from graph embeddings and structured relationships simultaneously for augmented generation.

Can I use hugegraph-ml without installing the other modules?

Yes. Each module is a standalone package with its own pyproject.toml, meaning you can install only hugegraph-ml via pip if your use case requires only graph machine learning capabilities. The modules share no hard dependencies, though they can interoperate when installed together—for example, using hugegraph-python-client to fetch data for hugegraph-ml training pipelines.

Which module should I start with for building a knowledge graph from text?

Start with hugegraph-llm, which contains flows specifically designed for knowledge graph construction from unstructured text. The Text2GremlinFlow and related pipelines in the scheduler can extract entities and relationships using LLMs and automatically insert them into HugeGraph via the underlying client connections. For the storage layer, you will also need hugegraph-python-client to initialize the graph schema before ingestion.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:

curl -s "https://instagit.com/install.md"

Add to your MCP client configuration:

{
  "mcpServers": {
    "instagit": {
      "command": "npx",
      "args": ["-y", "instagit@latest"]
    }
  }
}

Ask your agent:

"Use Instagit MCP to understand how apache/incubator-hugegraph-ai works."

Works with

Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →