Heterogeneous Graph

This document describes how to use heterogeneous graphs for training.

1. Introduction to Heterogeneous Graph

A heterogeneous graph is a graph structure composed of nodes and edges of different types. In a heterogeneous graph, nodes and edges can have diverse attributes and relationships, representing different entities and their complex associations.

In a heterogeneous graph, node types can represent different entities such as users, products, topics, etc., while edge types represent the relationships between different entities, such as the follow relationship between users, or the purchase relationship between users and products. Nodes and edges can have different attributes.

Heterogeneous graphs provide a powerful graph model that can better express and analyze real-world systems with multiple entity types and complex relationships. They have broad application prospects and research value in various fields of data analysis and applications.

2. Creating a Heterogeneous Graph

In TuGraph, a heterogeneous graph is composed of a series of edge relationships. Each relationship is defined by a triplet of strings: (source node type, edge type, target node type). The creation of a heterogeneous graph is similar to that of a homogeneous graph, but the triplet definition needs to be specified when creating the graph. The following example demonstrates:

    olapondb = PyOlapOnDB('Empty', db, txn, [("node", "edge", "node")])

In the code snippet above, the fourth parameter is the edge relationship definition of the heterogeneous graph. By specifying this parameter, you can select specific node and edge types for graph construction and training. If this parameter is not specified, all node and edge types will be used for graph construction by default.

3. Heterogeneous graph query interface

To facilitate user usage, TuGraph provides interfaces for querying node and edge types when the fourth parameter is given. The following examples demonstrate these interfaces:

3.1 Querying Node Types

olapondb.ntypes()

The return value is a list of node types, such as [‘node1’, ‘node2’, ‘node3’].

3.2 Querying Edge Types

olapondb.etypes()

The return value is a list of edge types, such as [‘edge1’, ‘edge2’, ‘edge3’].

3.3 Querying Node and Edge Types

olapondb.metagraph()

The return value is a list of string triplets (source node type, edge type, target node type), such as [(‘node1’, ‘edge1’, ‘node2’), (‘node2’, ‘edge2’, ‘node3’)].

4 Heterogeneous graph output format

The same as the homogeneous graph, the sampling data results of the heterogeneous graph are also stored in NodeInfo and EdgeInfo. Output data can be obtained in the following ways.

    NodeInfo = []
    EdgeInfo = []
    getdb.Process(db, olapondb, feature_len, NodeInfo, EdgeInfo)

The result of its stored information is as follows:

Graph Data	Storage Location
Edge Source	EdgeInfo[0]
Edge Destination	EdgeInfo[1]
Edge Type	EdgeInfo[2]
Vertex ID	NodeInfo[0]
Vertex Feature	NodeInfo[1]
Vertex Label	NodeInfo[2]
Vertex Type	NodeInfo[3]

5. Training Heterogeneous Graphs

The goal of training a heterogeneous graph is to learn the representation of nodes and edges in the graph for subsequent tasks such as node classification, link prediction, and graph clustering. To achieve this goal, researchers have proposed a variety of models based on graph neural networks (GNNs). These models update the representation of nodes by aggregating the information of neighboring nodes, thereby capturing the complex relationships in the graph structure.

Since heterogeneous graphs contain nodes and edges of multiple types, it is necessary to consider how to process these different types of information when designing GNN models. One common approach is to design different aggregation functions to process different types of neighboring nodes. In addition, it is also necessary to consider how to integrate these different types of information together so that the model can effectively learn the representation of nodes and edges.

TuGraph provides a method for training heterogeneous graphs using the ogbn-mag dataset, which can be used for reference.

The official docker provided by TuGraph does not yet provide an environment for heterogeneous graph training, so users need to install relevant dependency packages by themselves. Before training, you need to download the ogb and pandas packages. The specific installation method is as follows:

pip3 install ogb
pip3 install pandas==0.24.2

The training code looks like this:

def train(graph, model, model_save_path):
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-2, weight_decay=5e-4)
    model.train()
    s = time.time()
    load_time = time.time()
    graph = dgl.add_self_loop(graph)
    logits = model(graph, graph.ndata['feat'])
    loss = loss_fcn(logits, graph.ndata['label'])
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    train_time = time.time()
    # print('load time', load_time - s, 'train_time', train_time - load_time)
    current_loss = float(loss)
    if model_save_path != "":
        if 'min_loss' not in train.__dict__:
            train.min_loss = current_loss
        elif current_loss < train.min_loss:
            train.min_loss = current_loss
            model_save_path = 'best_model.pth'
        torch.save(model.state_dict(), model_save_path)
    return current_loss

The complete training code can be found in tugraph/learn/examples/train_full_mag.py.