# How hugegraph-ml Converts HugeGraph Data for DGL: From Gremlin Queries to GNN Tensors

> Learn how hugegraph-ml converts HugeGraph data for DGL by executing Gremlin queries and constructing native DGL graph objects with node features and masks.

- Repository: [The Apache Software Foundation/incubator-hugegraph-ai](https://github.com/apache/incubator-hugegraph-ai)
- Tags: how-to-guide
- Published: 2026-02-24

---

**The hugegraph-ml library bridges Apache HugeGraph and the Deep Graph Library (DGL) by executing Gremlin traversals, remapping vertex IDs to contiguous tensor indices, and constructing native DGL graph objects complete with node features, labels, and train/validation/test masks.**

The `apache/incubator-hugegraph-ai` repository provides the `hugegraph-ml` module to streamline graph machine learning workflows on top of HugeGraph. This module extracts graph topology and properties via the TinkerPop Gremlin API and transforms them into DGL-compatible data structures. The entire conversion workflow is encapsulated in the `HugeGraph2DGL` class located in [`hugegraph_ml/data/hugegraph2dgl.py`](https://github.com/apache/incubator-hugegraph-ai/blob/main/hugegraph_ml/data/hugegraph2dgl.py).

## Core Conversion Architecture

### Establishing the Gremlin Connection

The conversion process begins by initializing a connection to the HugeGraph server. The `HugeGraph2DGL` class constructor creates a `PyHugeClient` instance and obtains a `GremlinManager` to execute remote traversals.

```python
self._client = PyHugeClient(url=url, graph=graph, user=user, pwd=pwd, graphspace=graphspace)
self._graph_germlin = self._client.gremlin()

```

*Lines 31-42* of [`hugegraph2dgl.py`](https://github.com/apache/incubator-hugegraph-ai/blob/main/hugegraph2dgl.py) establish this connection, enabling the subsequent extraction of vertex and edge data through Gremlin queries.

### Querying Vertices and Edges

For homogeneous graphs, the `convert_graph` method executes two fundamental Gremlin traversals to retrieve raw graph data:

```python
vertices = self._graph_germlin.exec(f"g.V().hasLabel('{vertex_label}')")["data"]
edges    = self._graph_germlin.exec(f"g.E().hasLabel('{edge_label}')")["data"]

```

*Lines 54-55* return JSON payloads where each vertex contains an `id` and `properties` dictionary, and each edge contains `outV` (source), `inV` (target), and label information.

## Building DGL Graph Objects

### Homogeneous Graph Conversion

The static helper `_convert_graph_from_v_e` performs the heavy lifting of transforming raw Gremlin results into a `dgl.graph` object. This method executes a five-step pipeline:

1.  **ID-to-Index Mapping**: Creates a contiguous zero-based mapping from HugeGraph vertex IDs to DGL tensor indices (`vertex_id_to_idx = {vertex_id: idx ...}`), as seen in *lines 64-66*.
2.  **Edge Index Construction**: Translates source and target vertex IDs to their corresponding indices using the mapping dictionary (`src_idx = [vertex_id_to_idx[e["outV"]] ...]`, `dst_idx = [vertex_id_to_idx[e["inV"]] ...]`), shown in *lines 65-66*.
3.  **Graph Instantiation**: Creates the DGL graph structure via `graph_dgl = dgl.graph((src_idx, dst_idx))` (*line 67*).
4.  **Feature Attachment**: Optionally populates `graph_dgl.ndata["feat"]` with `torch.float32` tensors and `graph_dgl.ndata["label"]` with `torch.long` tensors (*lines 69-74*).
5.  **Mask Generation**: Creates boolean masks for train/validation/test splits stored in `graph_dgl.ndata["train_mask"]`, `val_mask`, and `test_mask` (*lines 75-80*).

The resulting object is a homogeneous DGL graph ready for direct consumption by GNN models such as GraphSAGE or GAT.

### Heterogeneous Graph Support

For multi-relational graphs, the `convert_hetero_graph` method extends this pattern to handle multiple vertex and edge types separately. It maintains distinct ID-mapping dictionaries per vertex type and constructs a `dgl.heterograph` object:

```python
hetero_graph = dgl.heterograph(edge_data_dict)
hetero_graph.nodes[vertex_label].data[prop] = vertex_label_data[vertex_label][prop]

```

*Lines 70-112* demonstrate the per-type bookkeeping required to ensure that indices remain contiguous within each node type while preserving the semantic relationships between different entity types.

### Dataset-Level Batch Conversion

Graph-level tasks (such as molecular property prediction or social network classification) require converting many small graphs simultaneously. The `convert_graph_dataset` method iterates over graph-level vertices and builds a `HugeGraphDataset`—a thin wrapper around `torch.utils.data.Dataset` defined in [`hugegraph_ml/data/hugegraph_dataset.py`](https://github.com/apache/incubator-hugegraph-ai/blob/main/hugegraph_ml/data/hugegraph_dataset.py) (*lines 21-31*).

Each subgraph is processed by `_convert_graph_from_v_e`, and the method returns a list of DGL graphs accompanied by metadata including the number of graphs, maximum node count, feature dimensions, and class counts (*lines 114-146*).

## Advanced Conversion Features

Beyond basic topology conversion, `hugegraph-ml` provides specialized methods for specific GNN architectures:

-   **Edge Feature Support**: `convert_graph_with_edge_feat` attaches edge attributes to `graph_dgl.edata["feat"]` for models like SEAL or EGNN.
-   **OGB-Style Splits**: `convert_graph_ogb` exposes standardized `train_mask`, `valid_mask`, and `test_mask` tensors compatible with the Open Graph Benchmark evaluation protocol.
-   **BGNN Compatibility**: `convert_hetero_graph_bgnn` processes categorical vertex attributes specifically for Bipartite Graph Neural Network architectures.

All extensions reuse the core ID-mapping and tensor construction utilities to ensure consistency across different downstream frameworks.

## Practical Implementation Examples

### Converting a Single Homogeneous Graph

```python
from hugegraph_ml.data.hugegraph2dgl import HugeGraph2DGL

hg = HugeGraph2DGL(url="http://127.0.0.1:8080", graph="hugegraph")
dgl_graph = hg.convert_graph(vertex_label="CORA_vertex", edge_label="CORA_edge")

print(dgl_graph)                    # DGLGraph with N nodes, E edges

print(dgl_graph.ndata["feat"].shape)   # (N, feature_dim)

print(dgl_graph.ndata["label"])        # node labels

```

### Converting a Heterogeneous Citation Network

```python
hg = HugeGraph2DGL(url="http://127.0.0.1:8080", graph="hugegraph")
hetero = hg.convert_hetero_graph(
    vertex_labels=["Paper_v", "Author_v", "Venue_v"],
    edge_labels=["writes_e", "cites_e", "published_in_e"]
)

print(hetero.ntypes)                # ['Paper_v', 'Author_v', 'Venue_v']

print(hetero.etypes)                # ['writes_e', 'cites_e', 'published_in_e']

print(hetero.nodes['Paper_v'].data["feat"].shape)

```

### Building a Dataset for Graph Classification

```python
from torch.utils.data import DataLoader

hg = HugeGraph2DGL(url="http://127.0.0.1:8080", graph="hugegraph")
dataset = hg.convert_graph_dataset(
    graph_vertex_label="MUTAG_graph_vertex",
    vertex_label="MUTAG_vertex",
    edge_label="MUTAG_edge"
)

loader = DataLoader(dataset, batch_size=4, shuffle=True)
for batch_graphs, batch_labels in loader:
    # Each element is a batched DGLGraph

    print(batch_graphs[0])
    break

```

## Summary

-   **Centralized Conversion Class**: The `HugeGraph2DGL` class in [`hugegraph_ml/data/hugegraph2dgl.py`](https://github.com/apache/incubator-hugegraph-ai/blob/main/hugegraph_ml/data/hugegraph2dgl.py) serves as the primary entry point for all HugeGraph-to-DGL transformations.
-   **Gremlin-Based Extraction**: All data retrieval occurs through the `PyHugeClient` Gremlin interface, ensuring compatibility with any HugeGraph deployment.
-   **ID Remapping Strategy**: The library explicitly maps arbitrary HugeGraph vertex IDs to contiguous zero-based indices required by DGL's sparse tensor representations.
-   **Multi-Modal Support**: Native support for homogeneous graphs, heterogeneous graphs, and graph-level datasets enables diverse GNN applications from node classification to molecular property prediction.
-   **PyTorch Integration**: Output objects are standard DGL graphs compatible with `torch.utils.data.DataLoader` and existing GNN model implementations.

## Frequently Asked Questions

### What is the primary class used to convert HugeGraph data to DGL format?

The `HugeGraph2DGL` class located in [`hugegraph_ml/data/hugegraph2dgl.py`](https://github.com/apache/incubator-hugegraph-ai/blob/main/hugegraph_ml/data/hugegraph2dgl.py) serves as the main conversion utility. It provides methods like `convert_graph` for homogeneous graphs, `convert_hetero_graph` for multi-relational graphs, and `convert_graph_dataset` for graph-level classification tasks.

### How does hugegraph-ml handle vertex ID mapping between HugeGraph and DGL?

The library creates a Python dictionary (`vertex_id_to_idx`) that maps arbitrary HugeGraph vertex IDs (which may be strings or non-contiguous integers) to contiguous zero-based indices starting at zero. This mapping is applied to both source and target vertices when constructing edge index tensors for DGL, as implemented in the `_convert_graph_from_v_e` helper method.

### Can hugegraph-ml convert heterogeneous graphs with multiple node and edge types?

Yes. The `convert_hetero_graph` method handles heterogeneous graphs by maintaining separate ID-mapping dictionaries for each vertex type and constructing a `dgl.heterograph` object. It accepts lists of vertex labels and edge labels, ensuring that nodes and edges of different types are properly segregated while preserving cross-type relationships.

### Does the library support graph-level datasets for tasks like graph classification?

Yes. The `convert_graph_dataset` method produces a `HugeGraphDataset` object (a PyTorch Dataset) containing multiple DGL graph objects. This supports batching via `torch.utils.data.DataLoader` for graph classification or regression tasks where each sample is an independent graph structure, such as molecular datasets or social network communities.