Using TuGraph Graph Learning Module for Node Classification

1. Introduction

GNN is a powerful tool for many machine learning tasks on graphs. In this introductory tutorial, you will learn the basic workflow of using GNN for node classification, i.e., predicting the category of nodes in a graph.

This document will show how to build a GNN for semi-supervised node classification on the Cora dataset with only a few labels, which is a citation network with papers as nodes and citations as edges. The task is to predict the category of a given paper. By completing this tutorial, you will be able to:

Load the Cora dataset.

Sample and build a GNN model using the sampling operator provided by TuGraph. Train the GNN model for node classification on CPU or GPU.

This document requires some experience in using graph neural networks and DGL.

2. Prerequisites

The TuGraph graph learning module requires TuGraph-db version 3.5.1 or above.

It is recommended to use Docker image tugraph-compile 1.2.4 or above for TuGraph deployment:

tugraph/tugraph-compile-ubuntu18.04:latest

tugraph/tugraph-compile-centos7:latest

tugraph/tugraph-compile-centos8:latest

The above images can be obtained on DockerHub. Please refer to Quick Start for specific operations.

3. Import Cora Dataset to TuGraph Database

3.1. Introduction to Cora Dataset

The Cora dataset consists of 2708 papers, classified into 7 categories. Each paper is represented by a 1433-dimensional bag-of-words feature, indicating whether a certain word appears in the paper. These bag-of-words features have been preprocessed to be normalized to a range from 0 to 1. The edges represent citation relationships between papers.

TuGraph has provided the Cora dataset import tool, which users can use directly.

3.2. Data Import

The Cora dataset is located in the test/integration/data/algo directory and contains the cora_vertices point set and the cora_edge edge set.

First, the Cora dataset needs to be imported into the TuGraph database. Please refer to Data Import.

In the build/output section below the line:

cp -r ../../test/integration/data/ ./ && cp -r ../../learn/examples/* ./

The number of commands related to the relevant document is to be built/output.

Execute the following command in the build/output directory:

./lgraph_import -c ./data/algo/cora.conf --dir ./coradb --overwrite 1

Here, cora.conf is the graph schema file representing the format of graph data, which can be found in test/integration/data/algo/cora.conf. coradb is the name of the graph data file after import, which represents the storage location of graph data.

4. Feature Conversion

Since the features in the Cora dataset are float arrays of length 1433, which are not supported by TuGraph for loading, they can be imported as strings and converted to char* for easier access later. The implementation can be referred to the feature_float.cpp file. The specific execution process is as follows:

Compile the import plugin in the build directory.(Can be skipped if TuGraph has been compiled): make feature_float_embed

Execute the following command in the build/output directory to perform the conversion: ./algo/feature_float_embed ./coradb

5. Compile Sampling Operator

The sampling operator is used to obtain graph data from the database and convert it into the required data structure. The specific execution process is as follows: Can be skipped if TuGraph has been compiled. Execute the following command in the tugraph-db/build directory: make -j2

or execute the following command in the tugraph-db/learn/procedures directory: python3 setup.py build_ext -i

from lgraph_db_python import *
import importlib
getdb = importlib.import_module("getdb")
getdb.Process(db, olapondb, feature_len, NodeInfo, EdgeInfo)

As shown in the code, after obtaining the algorithm .so file, import and use it.

6. Model training and saving

TuGraph calls the operators of the cython layer in the python layer to realize the training of the graph learning model. The usage of the TuGraph graph learning module is introduced as follows: Execute under the build/output folder python3 train_full_cora.py --model_save_path ./cora_model You can start training. If the final printed loss value is less than 0.9, the training is successful. So far, the graph model training is completed, and the model is saved in the cora_model file.

The detailed training process is as follows:

6.1. Data Loading

galaxy = PyGalaxy(args.db_path)
galaxy.SetCurrentUser(args.username, args.password)
db = galaxy.OpenGraph(args.graph_name, False)

As shown in the code, load the data into memory based on the path of the graph data, username, password, and subgraph name. TuGraph can load multiple subgraphs for graph training, but we only load one subgraph here.

6.2. Build Sampler

During the training process, the GetDB operator is first used to obtain the graph data from the database and convert it into the required data structure. The specific code is as follows:

    GetDB.Process(db_: lgraph_db_python.PyGraphDB, olapondb: lgraph_db_python.PyOlapOnDB, feature_num: size_t, NodeInfo: list, EdgeInfo: list)

As shown in the code, the results are stored in NodeInfo and EdgeInfo. NodeInfo and EdgeInfo are python list results, and their stored information is as follows:

Graph Data	Storage Position
Edge source	EdgeInfo[0]
Edge destination	EdgeInfo[1]
Vertex ID	NodeInfo[0]
Vertex features	NodeInfo[1]
Vertex label	NodeInfo[2]

Then a sampler is constructed:

batch_size = 5
count = 2708
sampler = TugraphSample(args)
dataloader = dgl.dataloading.DataLoader(fake_g,
    torch.arange(count),
    sampler,
    batch_size=batch_size,
    num_workers=0,
    )

6.3. Convert the results to the required format

src = EdgeInfo[0].astype('int64')
dst = EdgeInfo[1].astype('int64')
nodes_idx = NodeInfo[0].astype('int64')
remap(src, dst, nodes_idx)
features = NodeInfo[1].astype('float32')
labels = NodeInfo[2].astype('int64')
g = dgl.graph((src, dst))
g.ndata['feat'] = torch.tensor(features)
g.ndata['label'] = torch.tensor(labels)
return g

The results are converted to the required format to fit the training format.

6.4. Build the GCN model

class GCN(nn.Module):
    def __init__(self, in_size, hid_size, out_size):
        super().__init__()
        self.layers = nn.ModuleList()
        # two-layer GCN
        self.layers.append(dgl.nn.GraphConv(in_size, hid_size, activation=F.relu))
        self.layers.append(dgl.nn.GraphConv(hid_size, out_size))
        self.dropout = nn.Dropout(0.5)
    def forward(self, g, features):
        h = features
        for i, layer in enumerate(self.layers):
            if i != 0:
                h = self.dropout(h)
            h = layer(g, h)
        return h
def build_model():
    in_size = feature_len  # feature_len is the length of features, which is 1433 here
    out_size = classes  # classes is the number of classes, which is 7 here
    model = GCN(in_size, 16, out_size)  # 16 is the size of the hidden layer
    return model

In this tutorial, a two-layer Graph Convolutional Network (GCN) will be built. Each layer aggregates neighbor information to compute new node representations.

6.5. Train the GCN model

loss_fcn = nn.CrossEntropyLoss()
def train(graph, model, model_save_path):
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-2, weight_decay=5e-4)
    model.train()
    s = time.time()
    load_time = time.time()
    graph = dgl.add_self_loop(graph)
    logits = model(graph, graph.ndata['feat'])
    loss = loss_fcn(logits, graph.ndata['label'])
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    train_time = time.time()
    current_loss = float(loss)
    if model_save_path != "":   #If you need to save the model, give the model save path
        if 'min_loss' not in train.__dict__:
            train.min_loss = current_loss
        elif current_loss < train.min_loss:
            train.min_loss = current_loss
            model_save_path = 'best_model.pth'
        torch.save(model.state_dict(), model_save_path)
    return current_loss

for epoch in range(50):
    model.train()
    total_loss = 0
    loss = train(g, model)
    if epoch % 5 == 0:
        print('In epoch', epoch, ', loss', loss)
    sys.stdout.flush()

As shown in the code, iterative training is performed 50 times according to the defined sampler, optimizer and model, and the trained model is saved to the path model_save_path.

The output is as follows:

In epoch 0 , loss 1.9586775302886963
In epoch 5 , loss 1.543689250946045
In epoch 10 , loss 1.160698413848877
In epoch 15 , loss 0.8862786889076233
In epoch 20 , loss 0.6973256468772888
In epoch 25 , loss 0.5770673751831055
In epoch 30 , loss 0.5271289348602295
In epoch 35 , loss 0.45514997839927673
In epoch 40 , loss 0.43748989701271057
In epoch 45 , loss 0.3906335234642029

The graph learning module can be accelerated by using a GPU. If users want to run it on the GPU, they need to install the corresponding GPU driver and environment. For details, please refer to learn/README.md.

The complete code can be found in learn/examples/train_full_cora.py.