How hugegraph-ml Converts HugeGraph Data for DGL: From Gremlin Queries to GNN Tensors
The hugegraph-ml library bridges Apache HugeGraph and the Deep Graph Library (DGL) by executing Gremlin traversals, remapping vertex IDs to contiguous tensor indices, and constructing native DGL graph objects complete with node features, labels, and train/validation/test masks.
The apache/incubator-hugegraph-ai repository provides the hugegraph-ml module to streamline graph machine learning workflows on top of HugeGraph. This module extracts graph topology and properties via the TinkerPop Gremlin API and transforms them into DGL-compatible data structures. The entire conversion workflow is encapsulated in the HugeGraph2DGL class located in hugegraph_ml/data/hugegraph2dgl.py.
Core Conversion Architecture
Establishing the Gremlin Connection
The conversion process begins by initializing a connection to the HugeGraph server. The HugeGraph2DGL class constructor creates a PyHugeClient instance and obtains a GremlinManager to execute remote traversals.
self._client = PyHugeClient(url=url, graph=graph, user=user, pwd=pwd, graphspace=graphspace)
self._graph_germlin = self._client.gremlin()
Lines 31-42 of hugegraph2dgl.py establish this connection, enabling the subsequent extraction of vertex and edge data through Gremlin queries.
Querying Vertices and Edges
For homogeneous graphs, the convert_graph method executes two fundamental Gremlin traversals to retrieve raw graph data:
vertices = self._graph_germlin.exec(f"g.V().hasLabel('{vertex_label}')")["data"]
edges = self._graph_germlin.exec(f"g.E().hasLabel('{edge_label}')")["data"]
Lines 54-55 return JSON payloads where each vertex contains an id and properties dictionary, and each edge contains outV (source), inV (target), and label information.
Building DGL Graph Objects
Homogeneous Graph Conversion
The static helper _convert_graph_from_v_e performs the heavy lifting of transforming raw Gremlin results into a dgl.graph object. This method executes a five-step pipeline:
- ID-to-Index Mapping: Creates a contiguous zero-based mapping from HugeGraph vertex IDs to DGL tensor indices (
vertex_id_to_idx = {vertex_id: idx ...}), as seen in lines 64-66. - Edge Index Construction: Translates source and target vertex IDs to their corresponding indices using the mapping dictionary (
src_idx = [vertex_id_to_idx[e["outV"]] ...],dst_idx = [vertex_id_to_idx[e["inV"]] ...]), shown in lines 65-66. - Graph Instantiation: Creates the DGL graph structure via
graph_dgl = dgl.graph((src_idx, dst_idx))(line 67). - Feature Attachment: Optionally populates
graph_dgl.ndata["feat"]withtorch.float32tensors andgraph_dgl.ndata["label"]withtorch.longtensors (lines 69-74). - Mask Generation: Creates boolean masks for train/validation/test splits stored in
graph_dgl.ndata["train_mask"],val_mask, andtest_mask(lines 75-80).
The resulting object is a homogeneous DGL graph ready for direct consumption by GNN models such as GraphSAGE or GAT.
Heterogeneous Graph Support
For multi-relational graphs, the convert_hetero_graph method extends this pattern to handle multiple vertex and edge types separately. It maintains distinct ID-mapping dictionaries per vertex type and constructs a dgl.heterograph object:
hetero_graph = dgl.heterograph(edge_data_dict)
hetero_graph.nodes[vertex_label].data[prop] = vertex_label_data[vertex_label][prop]
Lines 70-112 demonstrate the per-type bookkeeping required to ensure that indices remain contiguous within each node type while preserving the semantic relationships between different entity types.
Dataset-Level Batch Conversion
Graph-level tasks (such as molecular property prediction or social network classification) require converting many small graphs simultaneously. The convert_graph_dataset method iterates over graph-level vertices and builds a HugeGraphDataset—a thin wrapper around torch.utils.data.Dataset defined in hugegraph_ml/data/hugegraph_dataset.py (lines 21-31).
Each subgraph is processed by _convert_graph_from_v_e, and the method returns a list of DGL graphs accompanied by metadata including the number of graphs, maximum node count, feature dimensions, and class counts (lines 114-146).
Advanced Conversion Features
Beyond basic topology conversion, hugegraph-ml provides specialized methods for specific GNN architectures:
- Edge Feature Support:
convert_graph_with_edge_featattaches edge attributes tograph_dgl.edata["feat"]for models like SEAL or EGNN. - OGB-Style Splits:
convert_graph_ogbexposes standardizedtrain_mask,valid_mask, andtest_masktensors compatible with the Open Graph Benchmark evaluation protocol. - BGNN Compatibility:
convert_hetero_graph_bgnnprocesses categorical vertex attributes specifically for Bipartite Graph Neural Network architectures.
All extensions reuse the core ID-mapping and tensor construction utilities to ensure consistency across different downstream frameworks.
Practical Implementation Examples
Converting a Single Homogeneous Graph
from hugegraph_ml.data.hugegraph2dgl import HugeGraph2DGL
hg = HugeGraph2DGL(url="http://127.0.0.1:8080", graph="hugegraph")
dgl_graph = hg.convert_graph(vertex_label="CORA_vertex", edge_label="CORA_edge")
print(dgl_graph) # DGLGraph with N nodes, E edges
print(dgl_graph.ndata["feat"].shape) # (N, feature_dim)
print(dgl_graph.ndata["label"]) # node labels
Converting a Heterogeneous Citation Network
hg = HugeGraph2DGL(url="http://127.0.0.1:8080", graph="hugegraph")
hetero = hg.convert_hetero_graph(
vertex_labels=["Paper_v", "Author_v", "Venue_v"],
edge_labels=["writes_e", "cites_e", "published_in_e"]
)
print(hetero.ntypes) # ['Paper_v', 'Author_v', 'Venue_v']
print(hetero.etypes) # ['writes_e', 'cites_e', 'published_in_e']
print(hetero.nodes['Paper_v'].data["feat"].shape)
Building a Dataset for Graph Classification
from torch.utils.data import DataLoader
hg = HugeGraph2DGL(url="http://127.0.0.1:8080", graph="hugegraph")
dataset = hg.convert_graph_dataset(
graph_vertex_label="MUTAG_graph_vertex",
vertex_label="MUTAG_vertex",
edge_label="MUTAG_edge"
)
loader = DataLoader(dataset, batch_size=4, shuffle=True)
for batch_graphs, batch_labels in loader:
# Each element is a batched DGLGraph
print(batch_graphs[0])
break
Summary
- Centralized Conversion Class: The
HugeGraph2DGLclass inhugegraph_ml/data/hugegraph2dgl.pyserves as the primary entry point for all HugeGraph-to-DGL transformations. - Gremlin-Based Extraction: All data retrieval occurs through the
PyHugeClientGremlin interface, ensuring compatibility with any HugeGraph deployment. - ID Remapping Strategy: The library explicitly maps arbitrary HugeGraph vertex IDs to contiguous zero-based indices required by DGL's sparse tensor representations.
- Multi-Modal Support: Native support for homogeneous graphs, heterogeneous graphs, and graph-level datasets enables diverse GNN applications from node classification to molecular property prediction.
- PyTorch Integration: Output objects are standard DGL graphs compatible with
torch.utils.data.DataLoaderand existing GNN model implementations.
Frequently Asked Questions
What is the primary class used to convert HugeGraph data to DGL format?
The HugeGraph2DGL class located in hugegraph_ml/data/hugegraph2dgl.py serves as the main conversion utility. It provides methods like convert_graph for homogeneous graphs, convert_hetero_graph for multi-relational graphs, and convert_graph_dataset for graph-level classification tasks.
How does hugegraph-ml handle vertex ID mapping between HugeGraph and DGL?
The library creates a Python dictionary (vertex_id_to_idx) that maps arbitrary HugeGraph vertex IDs (which may be strings or non-contiguous integers) to contiguous zero-based indices starting at zero. This mapping is applied to both source and target vertices when constructing edge index tensors for DGL, as implemented in the _convert_graph_from_v_e helper method.
Can hugegraph-ml convert heterogeneous graphs with multiple node and edge types?
Yes. The convert_hetero_graph method handles heterogeneous graphs by maintaining separate ID-mapping dictionaries for each vertex type and constructing a dgl.heterograph object. It accepts lists of vertex labels and edge labels, ensuring that nodes and edges of different types are properly segregated while preserving cross-type relationships.
Does the library support graph-level datasets for tasks like graph classification?
Yes. The convert_graph_dataset method produces a HugeGraphDataset object (a PyTorch Dataset) containing multiple DGL graph objects. This supports batching via torch.utils.data.DataLoader for graph classification or regression tasks where each sample is an independent graph structure, such as molecular datasets or social network communities.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →