Graph Machine Learning Tasks in HugeGraph-ML: A Complete Guide to Node, Link, and Graph-Level Learning

HugeGraph-ML supports nine production-ready graph machine learning tasks ranging from node classification and link prediction to graph-level classification and fraud detection, all built on Deep Graph Library (DGL) with consistent APIs for training, evaluation, and device management.

Apache HugeGraph-ML is a modular Python library in the apache/incubator-hugegraph-ai repository that simplifies complex graph analytics by providing high-level task wrappers around DGL models. Whether you need to classify nodes in billion-edge networks or detect fraudulent transactions in heterogeneous graphs, the library abstracts away boilerplate code for device placement, early stopping, and metric logging. All graph machine learning tasks follow a unified three-stage pattern: data conversion via hugegraph_ml.data, model definition in hugegraph_ml.models, and task orchestration through specialized classes in hugegraph_ml.tasks.

Node-Level Graph Machine Learning Tasks

Node-level tasks focus on learning representations or predicting labels for individual vertices. HugeGraph-ML provides four distinct approaches depending on your supervision requirements and graph scale.

Unsupervised Node Embedding

The Node Embedding task learns low-dimensional vectors for every node without requiring labeled data. Implemented in hugegraph_ml/tasks/node_embed.py, the NodeEmbed class trains models like GATNE (Graph Attention Network with Type Embedding) to produce dense node representations suitable for downstream clustering or visualization.

Node Classification (Full-Graph)

For supervised learning on graphs that fit in memory, the NodeClassify task in hugegraph_ml/tasks/node_classify.py predicts categorical labels using node features and training masks. The class handles the full training loop, automatically moving data to GPU via .to(self._device) and monitoring validation metrics with the EarlyStopping utility from hugegraph_ml/utils/early_stopping.py.

Scalable Node Classification with Sampling

When working with massive graphs, NodeClassifyWithSample in hugegraph_ml/tasks/node_classify_with_sample.py scales training using DGL's ClusterGCNSampler. This samples clusters of nodes rather than full-graph neighborhoods, enabling training on billion-edge datasets while maintaining model accuracy.

Edge-Aware Node Classification

The NodeClassifyWithEdge task, found in hugegraph_ml/tasks/node_classify_with_edge.py, extends standard classification by incorporating edge features into the learning process. This is critical for relational domains where the connection type or strength between nodes provides predictive signal beyond node attributes.

Graph-Level Prediction Tasks

Graph Classification

For problems requiring whole-graph predictions (such as molecular property prediction), GraphClassify in hugegraph_ml/tasks/graph_classify.py manages batched graph learning. The task uses DGL's GraphDataLoader to handle multiple graphs per batch, aggregating node representations into graph-level embeddings before feeding them to classification heads.

Link prediction tasks identify missing or future connections in the graph structure. HugeGraph-ML implements two distinct algorithmic approaches.

The LinkPredictionSeal class in hugegraph_ml/tasks/link_prediction_seal.py implements the SEAL (Subgraphs, Embeddings, and Attributes for Link prediction) framework. It extracts enclosing subgraphs around target edges, encodes them using a DGCNN (Deep Graph Convolutional Neural Network) defined in hugegraph_ml/models/seal.py, and trains with binary cross-entropy loss. The task automatically computes Hits@K metrics during evaluation.

LinkPredictionPGNN in hugegraph_ml/tasks/link_prediction_pgnn.py employs Probabilistic Graph Neural Networks, pre-selecting anchor nodes and leveraging shortest-path distances to model link probabilities. This approach excels in scenarios requiring distance-aware relational reasoning.

Specialized Heterogeneous and Fraud Detection Tasks

Heterogeneous Graph Embedding with GATNE

Multiplex heterogeneous networks require specialized handling. The HeteroSampleEmbedGATNE task in hugegraph_ml/tasks/hetero_sample_embed_gatne.py implements the GATNE algorithm, learning type-specific embeddings for different edge types and aggregating them with attention mechanisms. This handles graphs with multiple relation types (e.g., social networks with "friend," "colleague," and "family" edges).

Fraud Detection using CARE-GNN

For financial security applications, DetectorCaregnn in hugegraph_ml/tasks/fraud_detector_caregnn.py provides binary classification optimized for transaction graphs using the CARE-GNN architecture. This specialized task includes sampling strategies designed to handle highly imbalanced fraud detection datasets.

Common Architecture and Task Implementation Pattern

All graph machine learning tasks in HugeGraph-ML share consistent architectural components that accelerate development:

  • Unified Model Interface: Every model exposes forward, inference, and loss methods, enabling plug-and-play substitution of architectures like MLPClassifier, DGCNN, or DGLGATNE.
  • Automatic Device Management: Each task class initializes self._device based on a user-provided gpu index, automatically transferring both the DGL graph and model parameters to CPU or CUDA devices.
  • Early Stopping Integration: The EarlyStopping utility monitors specified metrics (loss or accuracy) and restores the best model checkpoint automatically, preventing overfitting without manual intervention.
  • Data Loading Abstraction: Tasks abstract DGL's GraphDataLoader, ClusterGCNSampler, and NeighborSampler, providing mini-batching for both homogeneous and heterogeneous graphs without requiring users to implement collate functions.

Practical Code Examples

Node Classification with MLP

import torch
from hugegraph_ml.data.hugegraph2dgl import HugeGraph2DGL
from hugegraph_ml.models.mlp import MLPClassifier
from hugegraph_ml.tasks.node_classify import NodeClassify

# Convert HugeGraph data to DGL format

hg2d = HugeGraph2DGL()
graph = hg2d.convert_graph(
    vertex_label="my_vertex",
    edge_label="my_edge",
    vertex_feature="feat",
)

# Initialize model with input/output dimensions matching your data

model = MLPClassifier(
    n_in_feat=graph.ndata["feat"].shape[1], 
    n_out_feat=5
)

# Execute training with automatic GPU placement and early stopping

task = NodeClassify(graph, model)
task.train(lr=1e-3, n_epochs=100, gpu=0)
metrics = task.evaluate()
print(metrics)  # {'accuracy': 0.82, 'loss': 0.34}

Unsupervised Node Embedding with GATNE

import torch
from hugegraph_ml.data.hugegraph2dgl import HugeGraph2DGL
from hugegraph_ml.models.gatne import DGLGATNE
from hugegraph_ml.tasks.hetero_sample_embed_gatne import HeteroSampleEmbedGATNE

hg2d = HugeGraph2DGL()
graph = hg2d.convert_graph(vertex_label="person", edge_label="relationship")

# Configure GATNE with embedding dimensions for heterogeneous types

gatne = DGLGATNE(
    num_nodes=graph.num_nodes(),
    embedding_size=128,
    embedding_u_size=64,
    edge_types=graph.etypes,
    edge_type_count=len(graph.etypes),
    dim_a=16,
)

task = HeteroSampleEmbedGATNE(graph, gatne)
task.train_and_embed(lr=5e-4, n_epochs=50, gpu=0)

# Extract learned embeddings

embeddings = graph.ndata["feat"]  # shape: (num_nodes, 128)
import torch
from hugegraph_ml.data.hugegraph2dgl import HugeGraph2DGL
from hugegraph_ml.models.seal import DGCNN, data_prepare
from hugegraph_ml.tasks.link_prediction_seal import LinkPredictionSeal

hg2d = HugeGraph2DGL()
graph, split_edge = hg2d.convert_graph_ogb(
    vertex_label="ogbl-collab_vertex",
    edge_label="ogbl-collab_edge",
    split_label="ogbl-collab_split_edge",
)

node_attr, edge_weight = data_prepare(graph, split_edge)

model = DGCNN(
    num_layers=3,
    hidden_units=32,
    k=30,
    gcn_type="gcn",
    node_attributes=node_attr,
    edge_weights=edge_weight,
    use_embedding=True,
    num_nodes=graph.num_nodes(),
    dropout=0.5,
)

task = LinkPredictionSeal(graph, split_edge, model)
task.train(lr=5e-3, n_epochs=200, gpu=0)

# Hits@50 and other metrics computed automatically during training

Graph-Level Classification

from hugegraph_ml.data.hugegraph_dataset import HugeGraphDataset
from hugegraph_ml.models.gnn import GCN
from hugegraph_ml.tasks.graph_classify import GraphClassify

dataset = HugeGraphDataset(root="data/ogbg_molhiv")
model = GCN(
    in_dim=dataset.num_node_features, 
    hidden_dim=128, 
    out_dim=dataset.num_classes
)

task = GraphClassify(dataset, model)
task.train(batch_size=32, n_epochs=150, gpu=0)

test_metrics = task.evaluate()
print(test_metrics)  # {'accuracy': 0.78, 'loss': 0.31}

Summary

  • Nine Core Tasks: HugeGraph-ML implements NodeEmbed, NodeClassify, NodeClassifyWithSample, NodeClassifyWithEdge, GraphClassify, LinkPredictionSeal, LinkPredictionPGNN, HeteroSampleEmbedGATNE, and DetectorCaregnn to cover node, link, and graph-level graph machine learning.
  • Consistent API: All tasks in hugegraph_ml/tasks/ expose uniform training methods and handle device placement, early stopping, and logging internally.
  • DGL Foundation: Built on Deep Graph Library, the library supports both full-graph and sampled training modes for scalability.
  • Heterogeneous Support: Specialized tasks like GATNE embedding and CARE-GNN fraud detection handle complex multi-relational graphs and imbalanced classification scenarios.

Frequently Asked Questions

What is the difference between NodeClassify and NodeClassifyWithSample?

NodeClassify performs full-graph training, loading the entire graph into GPU memory for each epoch, which is efficient for smaller datasets. NodeClassifyWithSample, implemented in hugegraph_ml/tasks/node_classify_with_sample.py, uses DGL's ClusterGCNSampler to train on sampled node clusters, enabling node classification on billion-edge graphs that exceed single-GPU memory constraints.

How does HugeGraph-ML handle GPU acceleration?

Each task class automatically configures self._device based on the gpu parameter passed to train() methods. The implementation calls .to(self._device) on both the model and the DGL graph object, ensuring tensors reside on the specified CUDA device without manual tensor relocation code.

Can I use custom models with the task classes?

Yes, provided your model implements the three required methods: forward() for computing node/graph representations, inference() for prediction without gradient computation, and loss() for calculating the optimization objective. The MLPClassifier in hugegraph_ml/models/mlp.py demonstrates this interface for reference.

What data formats does HugeGraph-ML support?

The library primarily converts HugeGraph database instances to DGL graphs via hugegraph_ml/data/hugegraph2dgl.py, supporting both property graphs and OGB (Open Graph Benchmark) formatted datasets with train/validation/test splits. For graph-level tasks, hugegraph_ml/data/hugegraph_dataset.py provides PyTorch Dataset-compatible wrappers.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →