how-to-guide

How hugegraph-ml Converts HugeGraph Data for DGL: From Gremlin Queries to GNN Tensors

February 24, 2026 apache/incubator-hugegraph-ai ↗

The hugegraph-ml library bridges Apache HugeGraph and the Deep Graph Library (DGL) by executing Gremlin traversals, remapping vertex IDs to contiguous tensor indices, and constructing native DGL graph objects complete with node features, labels, and train/validation/test masks.

The apache/incubator-hugegraph-ai repository provides the hugegraph-ml module to streamline graph machine learning workflows on top of HugeGraph. This module extracts graph topology and properties via the TinkerPop Gremlin API and transforms them into DGL-compatible data structures. The entire conversion workflow is encapsulated in the HugeGraph2DGL class located in hugegraph_ml/data/hugegraph2dgl.py.

Core Conversion Architecture

Establishing the Gremlin Connection

The conversion process begins by initializing a connection to the HugeGraph server. The HugeGraph2DGL class constructor creates a PyHugeClient instance and obtains a GremlinManager to execute remote traversals.

self._client = PyHugeClient(url=url, graph=graph, user=user, pwd=pwd, graphspace=graphspace)
self._graph_germlin = self._client.gremlin()

Lines 31-42 of hugegraph2dgl.py establish this connection, enabling the subsequent extraction of vertex and edge data through Gremlin queries.

Querying Vertices and Edges

For homogeneous graphs, the convert_graph method executes two fundamental Gremlin traversals to retrieve raw graph data:

vertices = self._graph_germlin.exec(f"g.V().hasLabel('{vertex_label}')")["data"]
edges    = self._graph_germlin.exec(f"g.E().hasLabel('{edge_label}')")["data"]

Lines 54-55 return JSON payloads where each vertex contains an id and properties dictionary, and each edge contains outV (source), inV (target), and label information.

Building DGL Graph Objects

Homogeneous Graph Conversion

The static helper _convert_graph_from_v_e performs the heavy lifting of transforming raw Gremlin results into a dgl.graph object. This method executes a five-step pipeline:

ID-to-Index Mapping: Creates a contiguous zero-based mapping from HugeGraph vertex IDs to DGL tensor indices (vertex_id_to_idx = {vertex_id: idx ...}), as seen in lines 64-66.
Edge Index Construction: Translates source and target vertex IDs to their corresponding indices using the mapping dictionary (src_idx = [vertex_id_to_idx[e["outV"]] ...], dst_idx = [vertex_id_to_idx[e["inV"]] ...]), shown in lines 65-66.
Graph Instantiation: Creates the DGL graph structure via graph_dgl = dgl.graph((src_idx, dst_idx)) (line 67).
Feature Attachment: Optionally populates graph_dgl.ndata["feat"] with torch.float32 tensors and graph_dgl.ndata["label"] with torch.long tensors (lines 69-74).
Mask Generation: Creates boolean masks for train/validation/test splits stored in graph_dgl.ndata["train_mask"], val_mask, and test_mask (lines 75-80).

The resulting object is a homogeneous DGL graph ready for direct consumption by GNN models such as GraphSAGE or GAT.

Heterogeneous Graph Support

For multi-relational graphs, the convert_hetero_graph method extends this pattern to handle multiple vertex and edge types separately. It maintains distinct ID-mapping dictionaries per vertex type and constructs a dgl.heterograph object:

hetero_graph = dgl.heterograph(edge_data_dict)
hetero_graph.nodes[vertex_label].data[prop] = vertex_label_data[vertex_label][prop]

Lines 70-112 demonstrate the per-type bookkeeping required to ensure that indices remain contiguous within each node type while preserving the semantic relationships between different entity types.

Dataset-Level Batch Conversion

Graph-level tasks (such as molecular property prediction or social network classification) require converting many small graphs simultaneously. The convert_graph_dataset method iterates over graph-level vertices and builds a HugeGraphDataset—a thin wrapper around torch.utils.data.Dataset defined in hugegraph_ml/data/hugegraph_dataset.py (lines 21-31).

Each subgraph is processed by _convert_graph_from_v_e, and the method returns a list of DGL graphs accompanied by metadata including the number of graphs, maximum node count, feature dimensions, and class counts (lines 114-146).

Advanced Conversion Features

Beyond basic topology conversion, hugegraph-ml provides specialized methods for specific GNN architectures:

Edge Feature Support: convert_graph_with_edge_feat attaches edge attributes to graph_dgl.edata["feat"] for models like SEAL or EGNN.
OGB-Style Splits: convert_graph_ogb exposes standardized train_mask, valid_mask, and test_mask tensors compatible with the Open Graph Benchmark evaluation protocol.
BGNN Compatibility: convert_hetero_graph_bgnn processes categorical vertex attributes specifically for Bipartite Graph Neural Network architectures.

All extensions reuse the core ID-mapping and tensor construction utilities to ensure consistency across different downstream frameworks.

Practical Implementation Examples

Converting a Single Homogeneous Graph

from hugegraph_ml.data.hugegraph2dgl import HugeGraph2DGL

hg = HugeGraph2DGL(url="http://127.0.0.1:8080", graph="hugegraph")
dgl_graph = hg.convert_graph(vertex_label="CORA_vertex", edge_label="CORA_edge")

print(dgl_graph)                    # DGLGraph with N nodes, E edges

print(dgl_graph.ndata["feat"].shape)   # (N, feature_dim)

print(dgl_graph.ndata["label"])        # node labels

Converting a Heterogeneous Citation Network

hg = HugeGraph2DGL(url="http://127.0.0.1:8080", graph="hugegraph")
hetero = hg.convert_hetero_graph(
    vertex_labels=["Paper_v", "Author_v", "Venue_v"],
    edge_labels=["writes_e", "cites_e", "published_in_e"]
)

print(hetero.ntypes)                # ['Paper_v', 'Author_v', 'Venue_v']

print(hetero.etypes)                # ['writes_e', 'cites_e', 'published_in_e']

print(hetero.nodes['Paper_v'].data["feat"].shape)

Building a Dataset for Graph Classification

from torch.utils.data import DataLoader

hg = HugeGraph2DGL(url="http://127.0.0.1:8080", graph="hugegraph")
dataset = hg.convert_graph_dataset(
    graph_vertex_label="MUTAG_graph_vertex",
    vertex_label="MUTAG_vertex",
    edge_label="MUTAG_edge"
)

loader = DataLoader(dataset, batch_size=4, shuffle=True)
for batch_graphs, batch_labels in loader:
    # Each element is a batched DGLGraph

    print(batch_graphs[0])
    break

Summary

Centralized Conversion Class: The HugeGraph2DGL class in hugegraph_ml/data/hugegraph2dgl.py serves as the primary entry point for all HugeGraph-to-DGL transformations.
Gremlin-Based Extraction: All data retrieval occurs through the PyHugeClient Gremlin interface, ensuring compatibility with any HugeGraph deployment.
ID Remapping Strategy: The library explicitly maps arbitrary HugeGraph vertex IDs to contiguous zero-based indices required by DGL's sparse tensor representations.
Multi-Modal Support: Native support for homogeneous graphs, heterogeneous graphs, and graph-level datasets enables diverse GNN applications from node classification to molecular property prediction.
PyTorch Integration: Output objects are standard DGL graphs compatible with torch.utils.data.DataLoader and existing GNN model implementations.

Frequently Asked Questions

What is the primary class used to convert HugeGraph data to DGL format?

The HugeGraph2DGL class located in hugegraph_ml/data/hugegraph2dgl.py serves as the main conversion utility. It provides methods like convert_graph for homogeneous graphs, convert_hetero_graph for multi-relational graphs, and convert_graph_dataset for graph-level classification tasks.

How does hugegraph-ml handle vertex ID mapping between HugeGraph and DGL?

The library creates a Python dictionary (vertex_id_to_idx) that maps arbitrary HugeGraph vertex IDs (which may be strings or non-contiguous integers) to contiguous zero-based indices starting at zero. This mapping is applied to both source and target vertices when constructing edge index tensors for DGL, as implemented in the _convert_graph_from_v_e helper method.

Can hugegraph-ml convert heterogeneous graphs with multiple node and edge types?

Yes. The convert_hetero_graph method handles heterogeneous graphs by maintaining separate ID-mapping dictionaries for each vertex type and constructing a dgl.heterograph object. It accepts lists of vertex labels and edge labels, ensuring that nodes and edges of different types are properly segregated while preserving cross-type relationships.

Does the library support graph-level datasets for tasks like graph classification?

Yes. The convert_graph_dataset method produces a HugeGraphDataset object (a PyTorch Dataset) containing multiple DGL graph objects. This supports batching via torch.utils.data.DataLoader for graph classification or regression tasks where each sample is an independent graph structure, such as molecular datasets or social network communities.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:

curl -s "https://instagit.com/install.md"

Add to your MCP client configuration:

{
  "mcpServers": {
    "instagit": {
      "command": "npx",
      "args": ["-y", "instagit@latest"]
    }
  }
}

Ask your agent:

"Use Instagit MCP to understand how apache/incubator-hugegraph-ai works."

Works with

Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →