Graph Machine Learning Tasks in HugeGraph-ML: A Complete Guide to Node, Link, and Graph-Level Learning
HugeGraph-ML supports nine production-ready graph machine learning tasks ranging from node classification and link prediction to graph-level classification and fraud detection, all built on Deep Graph Library (DGL) with consistent APIs for training, evaluation, and device management.
Apache HugeGraph-ML is a modular Python library in the apache/incubator-hugegraph-ai repository that simplifies complex graph analytics by providing high-level task wrappers around DGL models. Whether you need to classify nodes in billion-edge networks or detect fraudulent transactions in heterogeneous graphs, the library abstracts away boilerplate code for device placement, early stopping, and metric logging. All graph machine learning tasks follow a unified three-stage pattern: data conversion via hugegraph_ml.data, model definition in hugegraph_ml.models, and task orchestration through specialized classes in hugegraph_ml.tasks.
Node-Level Graph Machine Learning Tasks
Node-level tasks focus on learning representations or predicting labels for individual vertices. HugeGraph-ML provides four distinct approaches depending on your supervision requirements and graph scale.
Unsupervised Node Embedding
The Node Embedding task learns low-dimensional vectors for every node without requiring labeled data. Implemented in hugegraph_ml/tasks/node_embed.py, the NodeEmbed class trains models like GATNE (Graph Attention Network with Type Embedding) to produce dense node representations suitable for downstream clustering or visualization.
Node Classification (Full-Graph)
For supervised learning on graphs that fit in memory, the NodeClassify task in hugegraph_ml/tasks/node_classify.py predicts categorical labels using node features and training masks. The class handles the full training loop, automatically moving data to GPU via .to(self._device) and monitoring validation metrics with the EarlyStopping utility from hugegraph_ml/utils/early_stopping.py.
Scalable Node Classification with Sampling
When working with massive graphs, NodeClassifyWithSample in hugegraph_ml/tasks/node_classify_with_sample.py scales training using DGL's ClusterGCNSampler. This samples clusters of nodes rather than full-graph neighborhoods, enabling training on billion-edge datasets while maintaining model accuracy.
Edge-Aware Node Classification
The NodeClassifyWithEdge task, found in hugegraph_ml/tasks/node_classify_with_edge.py, extends standard classification by incorporating edge features into the learning process. This is critical for relational domains where the connection type or strength between nodes provides predictive signal beyond node attributes.
Graph-Level Prediction Tasks
Graph Classification
For problems requiring whole-graph predictions (such as molecular property prediction), GraphClassify in hugegraph_ml/tasks/graph_classify.py manages batched graph learning. The task uses DGL's GraphDataLoader to handle multiple graphs per batch, aggregating node representations into graph-level embeddings before feeding them to classification heads.
Link Prediction and Edge Analysis
Link prediction tasks identify missing or future connections in the graph structure. HugeGraph-ML implements two distinct algorithmic approaches.
SEAL-Based Link Prediction
The LinkPredictionSeal class in hugegraph_ml/tasks/link_prediction_seal.py implements the SEAL (Subgraphs, Embeddings, and Attributes for Link prediction) framework. It extracts enclosing subgraphs around target edges, encodes them using a DGCNN (Deep Graph Convolutional Neural Network) defined in hugegraph_ml/models/seal.py, and trains with binary cross-entropy loss. The task automatically computes Hits@K metrics during evaluation.
PGNN Link Prediction
LinkPredictionPGNN in hugegraph_ml/tasks/link_prediction_pgnn.py employs Probabilistic Graph Neural Networks, pre-selecting anchor nodes and leveraging shortest-path distances to model link probabilities. This approach excels in scenarios requiring distance-aware relational reasoning.
Specialized Heterogeneous and Fraud Detection Tasks
Heterogeneous Graph Embedding with GATNE
Multiplex heterogeneous networks require specialized handling. The HeteroSampleEmbedGATNE task in hugegraph_ml/tasks/hetero_sample_embed_gatne.py implements the GATNE algorithm, learning type-specific embeddings for different edge types and aggregating them with attention mechanisms. This handles graphs with multiple relation types (e.g., social networks with "friend," "colleague," and "family" edges).
Fraud Detection using CARE-GNN
For financial security applications, DetectorCaregnn in hugegraph_ml/tasks/fraud_detector_caregnn.py provides binary classification optimized for transaction graphs using the CARE-GNN architecture. This specialized task includes sampling strategies designed to handle highly imbalanced fraud detection datasets.
Common Architecture and Task Implementation Pattern
All graph machine learning tasks in HugeGraph-ML share consistent architectural components that accelerate development:
- Unified Model Interface: Every model exposes
forward,inference, andlossmethods, enabling plug-and-play substitution of architectures likeMLPClassifier,DGCNN, orDGLGATNE. - Automatic Device Management: Each task class initializes
self._devicebased on a user-providedgpuindex, automatically transferring both the DGL graph and model parameters to CPU or CUDA devices. - Early Stopping Integration: The
EarlyStoppingutility monitors specified metrics (loss or accuracy) and restores the best model checkpoint automatically, preventing overfitting without manual intervention. - Data Loading Abstraction: Tasks abstract DGL's
GraphDataLoader,ClusterGCNSampler, andNeighborSampler, providing mini-batching for both homogeneous and heterogeneous graphs without requiring users to implement collate functions.
Practical Code Examples
Node Classification with MLP
import torch
from hugegraph_ml.data.hugegraph2dgl import HugeGraph2DGL
from hugegraph_ml.models.mlp import MLPClassifier
from hugegraph_ml.tasks.node_classify import NodeClassify
# Convert HugeGraph data to DGL format
hg2d = HugeGraph2DGL()
graph = hg2d.convert_graph(
vertex_label="my_vertex",
edge_label="my_edge",
vertex_feature="feat",
)
# Initialize model with input/output dimensions matching your data
model = MLPClassifier(
n_in_feat=graph.ndata["feat"].shape[1],
n_out_feat=5
)
# Execute training with automatic GPU placement and early stopping
task = NodeClassify(graph, model)
task.train(lr=1e-3, n_epochs=100, gpu=0)
metrics = task.evaluate()
print(metrics) # {'accuracy': 0.82, 'loss': 0.34}
Unsupervised Node Embedding with GATNE
import torch
from hugegraph_ml.data.hugegraph2dgl import HugeGraph2DGL
from hugegraph_ml.models.gatne import DGLGATNE
from hugegraph_ml.tasks.hetero_sample_embed_gatne import HeteroSampleEmbedGATNE
hg2d = HugeGraph2DGL()
graph = hg2d.convert_graph(vertex_label="person", edge_label="relationship")
# Configure GATNE with embedding dimensions for heterogeneous types
gatne = DGLGATNE(
num_nodes=graph.num_nodes(),
embedding_size=128,
embedding_u_size=64,
edge_types=graph.etypes,
edge_type_count=len(graph.etypes),
dim_a=16,
)
task = HeteroSampleEmbedGATNE(graph, gatne)
task.train_and_embed(lr=5e-4, n_epochs=50, gpu=0)
# Extract learned embeddings
embeddings = graph.ndata["feat"] # shape: (num_nodes, 128)
Link Prediction with SEAL
import torch
from hugegraph_ml.data.hugegraph2dgl import HugeGraph2DGL
from hugegraph_ml.models.seal import DGCNN, data_prepare
from hugegraph_ml.tasks.link_prediction_seal import LinkPredictionSeal
hg2d = HugeGraph2DGL()
graph, split_edge = hg2d.convert_graph_ogb(
vertex_label="ogbl-collab_vertex",
edge_label="ogbl-collab_edge",
split_label="ogbl-collab_split_edge",
)
node_attr, edge_weight = data_prepare(graph, split_edge)
model = DGCNN(
num_layers=3,
hidden_units=32,
k=30,
gcn_type="gcn",
node_attributes=node_attr,
edge_weights=edge_weight,
use_embedding=True,
num_nodes=graph.num_nodes(),
dropout=0.5,
)
task = LinkPredictionSeal(graph, split_edge, model)
task.train(lr=5e-3, n_epochs=200, gpu=0)
# Hits@50 and other metrics computed automatically during training
Graph-Level Classification
from hugegraph_ml.data.hugegraph_dataset import HugeGraphDataset
from hugegraph_ml.models.gnn import GCN
from hugegraph_ml.tasks.graph_classify import GraphClassify
dataset = HugeGraphDataset(root="data/ogbg_molhiv")
model = GCN(
in_dim=dataset.num_node_features,
hidden_dim=128,
out_dim=dataset.num_classes
)
task = GraphClassify(dataset, model)
task.train(batch_size=32, n_epochs=150, gpu=0)
test_metrics = task.evaluate()
print(test_metrics) # {'accuracy': 0.78, 'loss': 0.31}
Summary
- Nine Core Tasks: HugeGraph-ML implements
NodeEmbed,NodeClassify,NodeClassifyWithSample,NodeClassifyWithEdge,GraphClassify,LinkPredictionSeal,LinkPredictionPGNN,HeteroSampleEmbedGATNE, andDetectorCaregnnto cover node, link, and graph-level graph machine learning. - Consistent API: All tasks in
hugegraph_ml/tasks/expose uniform training methods and handle device placement, early stopping, and logging internally. - DGL Foundation: Built on Deep Graph Library, the library supports both full-graph and sampled training modes for scalability.
- Heterogeneous Support: Specialized tasks like GATNE embedding and CARE-GNN fraud detection handle complex multi-relational graphs and imbalanced classification scenarios.
Frequently Asked Questions
What is the difference between NodeClassify and NodeClassifyWithSample?
NodeClassify performs full-graph training, loading the entire graph into GPU memory for each epoch, which is efficient for smaller datasets. NodeClassifyWithSample, implemented in hugegraph_ml/tasks/node_classify_with_sample.py, uses DGL's ClusterGCNSampler to train on sampled node clusters, enabling node classification on billion-edge graphs that exceed single-GPU memory constraints.
How does HugeGraph-ML handle GPU acceleration?
Each task class automatically configures self._device based on the gpu parameter passed to train() methods. The implementation calls .to(self._device) on both the model and the DGL graph object, ensuring tensors reside on the specified CUDA device without manual tensor relocation code.
Can I use custom models with the task classes?
Yes, provided your model implements the three required methods: forward() for computing node/graph representations, inference() for prediction without gradient computation, and loss() for calculating the optimization objective. The MLPClassifier in hugegraph_ml/models/mlp.py demonstrates this interface for reference.
What data formats does HugeGraph-ML support?
The library primarily converts HugeGraph database instances to DGL graphs via hugegraph_ml/data/hugegraph2dgl.py, supporting both property graphs and OGB (Open Graph Benchmark) formatted datasets with train/validation/test splits. For graph-level tasks, hugegraph_ml/data/hugegraph_dataset.py provides PyTorch Dataset-compatible wrappers.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →