High Availability mode

This document describes the principles, preparations, and server operations of the high availability mode

1.Theory

TuGraph provides high availability (HA) mode through multi-machine hot backup. In high availability mode, write operations to the database will be synchronized to all servers (non-witness), so that even if some servers are down, the availability of the service will not be affected.

When the high-availability mode is started, multiple TuGraph servers form a backup group, which is a high-availability cluster. Each backup group consists of three or more TuGraph servers, one of which serves as the leader and the other replication group servers as followers. Write requests are served by a leader, which replicates and synchronizes each request to a follower and can only respond to the client after the request has been synchronized to the server. This way, if any server fails, the other servers will still have all the data written so far. If the leader server fails, other servers will automatically select a new leader.

TuGraph’s high-availability mode provides two types of nodes: replica nodes and witness nodes. Among them, the replica node is an ordinary node, has logs and data, and can provide services to the outside world. The witness node is a node that only receives heartbeats and logs but does not save data. According to deployment requirements, leader nodes and follower nodes can be flexibly deployed as replica nodes or witness nodes. Based on this, there are two deployment methods for TuGraph high-availability mode: one is the ordinary deployment mode, and the other is the simple deployment mode with witness.

For normal deployment mode, leader and all followers are nodes of type replica. Write requests are served by a leader, which copies each request to a follower and cannot respond to the client until the request has been synchronized to more than half of the servers. This way, if less than half of the servers fail, the other servers will still have all the data written so far. If the leader server fails, other servers will automatically elect a new leader to ensure data consistency and service availability.

However, when the user server resources are insufficient or a network partition occurs, a normal HA cluster cannot be established. At this time, since the witness node has no data and takes up little resources, the witness node and the replica node can be deployed on one machine. For example, when there are only 2 machines, you can deploy the replica node on one machine, and the replica node and witness node on another machine, which not only saves resources, but also does not require log application To the state machine, there is no need to generate and install snapshots, so the response to requests is very fast, and it can help quickly elect a new leader when the cluster crashes or the network is partitioned. This is the simple deployment mode of the TuGraph HA cluster. Although witness nodes have many benefits, since there is no data, the cluster actually adds a node that cannot become leader, so the availability will be slightly reduced. To improve the availability of the cluster, you can allow the witness node to be the leader temporarily by specifying the ha_enable_witness_to_leader parameter as true. After the witness node synchronizes the new log to other nodes, it will actively switch the leader role to the node with the latest log.

This feature is supported in version 3.6 and above.

2.Preparation

To enable high availability mode, users need to:

Three or more instances of TuGraph servers.
To enable high availability mode when starting lgraph_server, the ‘enable_ha’ option can be set to ‘true’ using a configuration file or the command line.
Set the correct rpc_port through the configuration file or command line

3.Start the initial backup group

After installing TuGraph, you can use the lgraph_server command to start a high-availability cluster on different machines. This section mainly explains how to start a high-availability cluster. For cluster status management after startup, see lgraph_peer tool

3.1.The initial data is consistent

When the data in all servers is the same or there is no data at startup, the user can specify --ha_conf host1:port1,host2:port2 to start the server. In this way, all prepared TuGraph instances can be added to the initial backup group at one time, All servers in the backup group elect leader according to the RAFT protocol, and other servers join the backup group with the role of follower.

An example command to start an initial backup group is as follows:

$ ./lgraph_server -c lgraph.json --rpc_port 9090 --enable_ha true --ha_conf 172.22.224.15:9090,172.22.224.16:9090,172.22.224.17:9090

After the first server is started, it will elect itself as the ‘leader’ and organize a backup group with only itself.

3.2.Inconsistent initial data

If there is already data in the first server (imported by the lgraph_import tool or transferred from a server in non-high availability mode), And it has not been used in high-availability mode before, the user should use the boostrap method to start. Start the server with data in bootstrap mode with the ha_bootstrap_role parameter as 1, and specify the machine as the leader through the ha_conf parameter. In bootstrap mode, the server will copy its own data to the new server before adding the newly joined server to the backup group, so that the data in each server is consistent.

An example command to start a data server is as follows:

$ ./lgraph_server -c lgraph.json --rpc_port 9090 --enable_ha true --ha_conf 172.22.224.15:9090,172.22.224.16:9090,172.22.224.17:9090 --ha_bootstrap_role 1

Other servers without data need to specify the ha_bootstrap_role parameter as 2, and specify the leader through the ha_conf parameter. The command example is as follows

**$ ./lgraph_server -c lgraph.json --rpc_port 9090 --enable_ha true --ha_conf 172.22.224.15:9090,172.22.224.16:9090,172.22.224.17:9090 --ha_bootstrap_role 2

You need to pay attention to two points when using bootstrap to start an HA cluster:

You need to wait for the leader node to generate a snapshot and start successfully before joining the follower node, otherwise the follower node may fail to join. When starting the follower node, you can configure the ha_node_join_group_s parameter to be slightly larger to allow multiple waits and timeout retries when joining the HA cluster.
The HA cluster can only use the bootstrap mode when it is started for the first time. It can only be started in the normal mode (see Section 3.1) when it is started later. In particular, multiple nodes of the same cluster cannot be started in the bootstrap mode, otherwise it may cause Data inconsistency

4.Start witness node

4.1. Witness nodes are not allowed to become leader

The startup method of witness node is the same as that of ordinary nodes. You only need to set the ha_is_witness parameter to true. Note that the number of witness nodes should be less than half of the total number of cluster nodes.

An example command to start the witness node server is as follows:

$ ./lgraph_server -c lgraph.json --rpc_port 9090 --enable_ha true --ha_conf 172.22.224.15:9090,172.22.224.16:9090,172.22.224.17:9090 --ha_is_witness 1

Note: By default, the witness node is not allowed to become the leader node, which can improve the performance of the cluster, but will reduce the availability of the cluster when the leader node crashes.

4.1. Allow witness nodes to become leaders

You can specify the ha_enable_witness_to_leader parameter as true, so that the witness node can temporarily become the leader node, and then actively switch to the master after the new log synchronization is completed.

An example of the command to start the witness node server that is allowed to become the leader node is as follows:

$ ./lgraph_server -c lgraph.json --rpc_port 9090 --enable_ha true --ha_conf 172.22.224.15:9090,172.22.224.16:9090,172.22.224.17:9090 --ha_is_witness 1 --ha_enable_witness_to_leader 1

Note: Although allowing witness nodes to become leader nodes can improve the availability of the cluster, it may affect data consistency in extreme cases. Therefore, it should generally be ensured that the number of witness nodes + 1 is less than half of the total number of cluster nodes.

5.Scale out other servers

After starting the initial backup group, if you want to scale out the backup group, add new servers to the backup group, The --ha_conf HOST:PORT option should be used, where HOST can be the IP address of any server already in this backup group, And PORT is its RPC port. E.g:

./lgraph_server -c lgraph.json --rpc_port 9090 --enable_ha true --ha_conf 172.22.224.15:9090

This command will start a TuGraph server in high availability mode and try to add it to the backup group containing the server 172.22.224.15:9090. Note that joining a backup group requires a server to synchronize its data with the backup group’s leader server, and this process may take a considerable amount of time, depending on the size of the data.

6.Stopping the Server

When a server goes offline via ‘CTRL-C’, it will notify the current ‘leader’ server to remove the server from the backup group. If the leader server goes offline, it will pass the leader identity permission to another server before going offline.

If a server is terminated or disconnected from other servers in the backup group, the server is considered a failed node and the leader server will remove it from the backup group after a specified time limit.

If any server leaves the backup group and wishes to rejoin, it must start with the ‘–ha_conf {HOST:PORT}’ option, where ‘HOST’ is the IP address of a server in the current backup group.

7.Restarting the Server

Restarting the entire backup group is not recommended as it disrupts service. All servers can be shut down if desired. But on reboot, It must be ensured that at least N/2+1 servers in the backup group at shutdown can start normally, otherwise the startup will fail. and, Regardless of whether enable_bootstrap is specified as true when initially starting the replication group, restarting the server only needs to pass Specify the --ha_conf host1:port1,host2:port2 parameter to restart all servers at once. The command example is as follows:

$ ./lgraph_server -c lgraph.json --rpc_port 9090 --enable_ha true --ha_conf 172.22.224.15:9090,172.22.224.16:9090,172.22.224.17:9090

8.docker deploys a highly available cluster

In real business scenarios, it is likely to encounter the need to deploy high-availability clusters on multiple operating systems or architectures. Differentiated environments may cause some dependencies to be missing when compiling TuGraph. therefore, Compiling software in docker and deploying high-availability clusters is very valuable. Take the centos7 version of docker as an example, The steps to deploy a highly available cluster are as follows.

8.1.Install mirror

Use the following command to download TuGraph’s compiled docker image environment

docker pull tugraph/tugraph-compile-centos7

Then pull the TuGraph source code and compile and install

8.2.Create container

Use the following command to create a container, use --net=host to make the container run in host mode, in this mode Docker and the host machine share the network namespace, that is, they share the same IP.

docker run --net=host -itd -p -v {src_dir}:{dst_dir} --name tugraph_ha tugraph/tugraph-compile-centos7 /bin/bash

8.3.Start service

Use the following command to start the service on each server, because docker and the host share IP, so you can directly specify to start the service on the host IP

$ lgraph_server -c lgraph.json --host 172.22.224.15 --rpc_port 9090 --enable_ha true --ha_conf 172.22.224.15:9090,172.22.224.16:9090,172.22.224.17:9090

9.Server Status

The current status of the backup group can be obtained from the TuGraph visualization tool, REST API, and Cypher query.

In the TuGraph visualization tool, you can find the list of servers and their roles in the backup group in the DBInfo section.

With the REST API, you can use GET /info/peers to request information.

In Cypher, the CALL dbms.listServers() statement is used to query the status information of the current backup group.

10.Data synchronization in high availability mode

In high availability mode, different servers in the same backup group may not always be in the same state. For performance reasons, if a request has been synchronized to more than half of the servers, the leader server will consider the request to be in the committed state. Although the rest of the servers will eventually receive the new request, the inconsistent state of the servers will persist for some time. A client may also send a request to a server that has just restarted, thus having an older state and waiting to join a backup group.

To ensure that the client sees consistently continuous data, and in particular to get rid of the ‘reverse time travel’ problem, where the client reads a state older than it has seen before, each TuGraph server keeps a monotonically increasing data version number. The mapping of the data version number to the database state in the backup group is globally consistent, meaning that if two servers have the same data version number, they must have the same data. When responding to a request, the server includes its data version number in the response. Thus, the client can tell which version it has seen. The client can choose to send this data version number along with the request. Upon receiving a request with a data version number, the server compares the data version number to its current version and rejects the request if its own version is lower than the requested version. This mechanism ensures that the client never reads a state that is older than before.