Monitoring

This document mainly introduces the visualization operation and maintenance monitoring of TuGraph.

1.Design Concept

Visualization monitoring is not an indispensable part of TuGraph itself. Therefore, in the design, visualization monitoring is regarded as an application in the peripheral ecology of TuGraph to reduce the coupling with TuGraph database and its own impact. TuGraph visualization monitoring adopts the hottest open source solution - TuGraph Monitor + Prometheus + Grafana. Wherein, TuGraph Monitor is the client of TuGraph service, which initiates Procedure requests to TuGraph service through TCP connection. After receiving the request, TuGraph service collects statistics of CPU, memory, disk, IO, and number of requests on its own machine and responds. TuGraph Monitor packages the index data received from TuGraph into the format required by Prometheus and saves it in memory, waiting for Prometheus to obtain it through HTTP request. Prometheus regularly obtains packaged request data from TuGraph Monitor through HTTP requests and saves it in its own time-series database according to the time obtained. Grafana can obtain the statistical data within a certain time period from Prometheus according to user configuration and draw easy-to-understand graphics on the web interface to display the final result. In the entire request chain, the active acquisition, i.e. pull model, is used, which has the advantages of minimizing the coupling between data producers and data consumers, making development simpler, and not requiring data producers to consider the data processing capability of data consumers. Even if the data processing ability of a consumer is weak, it will not be crushed by the data producer producing data too fast. One of the shortcomings of the active pulling model is that the real-time performance of the data is insufficient. However, in this scenario, data does not have high real-time requirements.

1.1.TuGraph

TuGraph database provides the ability to collect multiple data information, such as disk, memory, network IO, and query requests, in the service machine, and provides queries through standard Procedure methods. This action of collecting data only occurs when a user queries through the interface, avoiding the impact of TuGraph monitoring service on user business query requests when users do not need the indicators on TuGraph monitoring service machine.

1.2.TuGraph Monitor

TuGraph Monitor is a tool in the peripheral ecology of TuGraph. It communicates with TuGraph through C++ RPC Client as one of the many users of TuGraph and queries the performance indicators of the machine where TuGraph service is located through the Procedure query interface. It packages the results returned by TuGraph into the data model required by Prometheus and waits for Prometheus to obtain them. Users can set the query time interval to minimize the impact of obtaining monitoring indicators on business queries.

1.3.Prometheus

Prometheus is an open-source monitoring platform with a dedicated time-series database. It regularly obtains statistical indicators from the TuGraph Monitor service through HTTP requests and saves them in its own time-series database. For details, please refer to the official website:https://prometheus.io/docs/introduction/first_steps

1.4.Grafana

Grafana is an open-source visualization and analysis software. It can obtain data from multiple data sources, including Prometheus, and can convert data in the time-series database into tools for beautiful graphics and visual effects. For specific information, please refer to the official website:https://grafana.com/docs/grafana/v7.5/getting-started/

2.Deployment Solution

2.1.Step 1

Start the TuGraph service. For details, please refer to the documentation:https://github.com/TuGraph-db/tugraph-db/blob/master/doc/zh-CN/1.guide/3.quick-start.md

2.2.Step 2

Start the TuGraph Monitor tool. The startup command is as follows:

./lgraph_monitor --server_host 127.0.0.1:9091 -u admin -p your_password \
			--monitor_host 127.0.0.1:9999  --sampling_interval_ms 1000

The meanings of the parameters are as follows:

Available command line options:
    --server_host       Host on which the tugraph rpc server runs.
                        Default=127.0.0.1:9091.
    -u, --user          DB username.
    -p, --password      DB password.
    --monitor_host      Host on which the monitor restful server runs.
                        Default=127.0.0.1:9999.
    --sampling_interval_ms
                        sampling interval in millisecond. Default=1.5e2.
    -h, --help          Print this help message. Default=0.

2.3.Step Three

  • Download the Prometheus tarball that matches your machine architecture and system version. Download link:https://prometheus.io/download/

  • Unzip the tarball with the following command:

tar -zxvf prometheus-2.37.5.linux-amd64.tar.gz
  • Modify the configuration file prometheus.yml and add the following configuration to enable Prometheus to scrape performance data packaged by TuGraph Monitor.

scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: "tugraph"

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
      - targets: ["localhost:9111"]
  • Start Prometheus. You can obtain specific startup parameters with the following command:

./prometheus -h
  • Verify if the Prometheus service is running properly by logging into the Prometheus service through the web portal, and querying whether the monitoring index resources_report has been obtained. If the data can be successfully queried, the operation is successful.

2.4.Step Four

  • Download the Grafana installation package that matches your machine architecture and system version. Download link:https://grafana.com/grafana/download

  • Install Grafana. For details, refer to: https://grafana.com/docs/grafana/v7.5/installation/

  • Start Grafana. For details, refer to: https://grafana.com/docs/grafana/v7.5/installation/

  • Configure Grafana. First, configure the IP address of Prometheus in the data source settings. After the configuration is completed, verify if the data source is connected successfully by testing the connection function. Then, import the following template and modify the correct interface IP and port in the page according to the actual situation. Finally, set the refresh time and monitoring time range according to the actual situation.

{
  "annotations": {
    "list": [
      {
        "builtIn": 1,
        "datasource": {
          "type": "grafana",
        },
        "enable": true,
        "hide": true,
        "iconColor": "rgba(0, 211, 255, 1)",
        "name": "Annotations & Alerts",
        "target": {
          "limit": 100,
          "matchAny": false,
          "tags": [],
          "type": "dashboard"
        },
        "type": "dashboard"
      }
    ]
  },
  "editable": true,
  "fiscalYearStartMonth": 0,
  "graphTooltip": 0,
  "id": 2,
  "links": [],
  "liveNow": false,
  "panels": [
    {
      "datasource": {
        "type": "prometheus",
      },
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            }
          },
          "mappings": [],
          "unit": "kbytes"
        },
        "overrides": [
          {
            "matcher": {
              "id": "byName",
              "options": "D {instance=\"localhost:7010\", job=\"TuGraph\", resouces_type=\"memory\", type=\"available\"}"
            },
            "properties": [
              {
                "id": "displayName",
                "value": "others"
              }
            ]
          },
          {
            "matcher": {
              "id": "byName",
              "options": "D {__name__=\"resources_report\", instance=\"localhost:7010\", job=\"TuGraph\", resouces_type=\"memory\", type=\"available\"}"
            },
            "properties": [
              {
                "id": "color",
                "value": {
                  "fixedColor": "light-green",
                  "mode": "fixed"
                }
              },
              {
                "id": "displayName",
                "value": "others"
              }
            ]
          },
          {
            "matcher": {
              "id": "byName",
              "options": "others"
            },
            "properties": [
              {
                "id": "color",
                "value": {
                  "fixedColor": "light-blue",
                  "mode": "fixed"
                }
              }
            ]
          },
          {
            "matcher": {
              "id": "byName",
              "options": "graph_used"
            },
            "properties": [
              {
                "id": "color",
                "value": {
                  "fixedColor": "light-orange",
                  "mode": "fixed"
                }
              }
            ]
          }
        ]
      },
      "gridPos": {
        "h": 16,
        "w": 6,
        "x": 0,
        "y": 0
      },
      "id": 14,
      "options": {
        "displayLabels": [
          "name",
          "value"
        ],
        "legend": {
          "displayMode": "table",
          "placement": "bottom",
          "values": [
            "percent",
            "value"
          ]
        },
        "pieType": "pie",
        "reduceOptions": {
          "calcs": [
            "lastNotNull"
          ],
          "fields": "",
          "values": false
        },
        "tooltip": {
          "mode": "single",
          "sort": "none"
        }
      },
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
          },
          "editorMode": "code",
          "expr": "resources_report{instance=\"localhost:7010\",job=\"TuGraph\",resouces_type=\"memory\",type=\"self\"}",
          "legendFormat": "{{type}}",
          "range": true,
          "refId": "A"
        },
        {
          "datasource": {
            "type": "prometheus",
          },
          "editorMode": "code",
          "expr": "resources_report{instance=\"localhost:7010\",job=\"TuGraph\",resouces_type=\"memory\",type=\"available\"}",
          "hide": false,
          "legendFormat": "{{type}}",
          "range": true,
          "refId": "B"
        },
        {
          "datasource": {
            "type": "prometheus",
          },
          "editorMode": "code",
          "expr": "resources_report{instance=\"localhost:7010\",job=\"TuGraph\",resouces_type=\"memory\",type=\"total\"}",
          "hide": true,
          "legendFormat": "{{label_name}}",
          "range": true,
          "refId": "C"
        },
        {
          "datasource": {
            "type": "__expr__",
          },
          "expression": "$C -$A - $B",
          "hide": false,
          "refId": "D",
          "type": "math"
        }
      ],
      "title": "memory",
      "type": "piechart"
    },
    {
      "alert": {
        "alertRuleTags": {},
        "conditions": [
          {
            "evaluator": {
              "params": [
                1000
              ],
              "type": "gt"
            },
            "operator": {
              "type": "and"
            },
            "query": {
              "params": [
                "A",
                "5m",
                "now"
              ]
            },
            "reducer": {
              "params": [],
              "type": "avg"
            },
            "type": "query"
          }
        ],
        "executionErrorState": "alerting",
        "for": "5m",
        "frequency": "1m",
        "handler": 1,
        "message": "[Production Graph Database Grafana] \n  QPS exceeds 1000",
        "name": "Request statistics alert",
        "noDataState": "no_data",
        "notifications": []
      },
      "datasource": {
        "type": "prometheus",
      },
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "axisLabel": "",
            "axisPlacement": "auto",
            "barAlignment": 0,
            "drawStyle": "line",
            "fillOpacity": 7,
            "gradientMode": "none",
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "lineInterpolation": "smooth",
            "lineWidth": 1,
            "pointSize": 5,
            "scaleDistribution": {
              "type": "linear"
            },
            "showPoints": "auto",
            "spanNulls": false,
            "stacking": {
              "group": "A",
              "mode": "none"
            },
            "thresholdsStyle": {
              "mode": "off"
            }
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          },
          "unit": " "
        },
        "overrides": [
          {
            "matcher": {
              "id": "byName",
              "options": "write"
            },
            "properties": [
              {
                "id": "color",
                "value": {
                  "fixedColor": "light-blue",
                  "mode": "fixed"
                }
              }
            ]
          }
        ]
      },
      "gridPos": {
        "h": 16,
        "w": 12,
        "x": 6,
        "y": 0
      },
      "id": 4,
      "options": {
        "legend": {
          "calcs": [
            "min",
            "max",
            "mean",
            "last"
          ],
          "displayMode": "table",
          "placement": "bottom"
        },
        "tooltip": {
          "mode": "single",
          "sort": "none"
        }
      },
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
          },
          "editorMode": "code",
          "expr": "{instance=\"localhost:7010\",job=\"TuGraph\",resouces_type=\"request\",type=~\"total|write\"}",
          "legendFormat": "{{type}}",
          "range": true,
          "refId": "A"
        }
      ],
      "thresholds": [
        {
          "colorMode": "critical",
          "op": "gt",
          "value": 1000,
          "visible": true
        }
      ],
      "title": "Request statistics",
      "type": "timeseries"
    },
    {
      "datasource": {
        "type": "prometheus",
      },
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            }
          },
          "mappings": [],
          "unit": "decbits"
        },
        "overrides": [
          {
            "matcher": {
              "id": "byName",
              "options": "graph_used"
            },
            "properties": [
              {
                "id": "color",
                "value": {
                  "fixedColor": "light-red",
                  "mode": "fixed"
                }
              }
            ]
          },
          {
            "matcher": {
              "id": "byName",
              "options": "available"
            },
            "properties": [
              {
                "id": "color",
                "value": {
                  "fixedColor": "light-orange",
                  "mode": "fixed"
                }
              }
            ]
          },
          {
            "matcher": {
              "id": "byName",
              "options": "D"
            },
            "properties": [
              {
                "id": "displayName",
                "value": "other"
              }
            ]
          },
          {
            "matcher": {
              "id": "byName",
              "options": "other"
            },
            "properties": [
              {
                "id": "color",
                "value": {
                  "fixedColor": "light-purple",
                  "mode": "fixed"
                }
              }
            ]
          }
        ]
      },
      "gridPos": {
        "h": 16,
        "w": 6,
        "x": 18,
        "y": 0
      },
      "id": 12,
      "options": {
        "displayLabels": [
          "name",
          "value"
        ],
        "legend": {
          "displayMode": "table",
          "placement": "bottom",
          "sortBy": "Value",
          "sortDesc": true,
          "values": [
            "value",
            "percent"
          ]
        },
        "pieType": "pie",
        "reduceOptions": {
          "calcs": [
            "lastNotNull"
          ],
          "fields": "",
          "values": false
        },
        "tooltip": {
          "mode": "single",
          "sort": "none"
        }
      },
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
          },
          "editorMode": "code",
          "exemplar": false,
          "expr": "resources_report{instance=\"localhost:7010\",job=\"TuGraph\",resouces_type=\"disk\",type=\"available\"}",
          "format": "time_series",
          "instant": false,
          "interval": "",
          "legendFormat": "{{type}}",
          "range": true,
          "refId": "A"
        },
        {
          "datasource": {
            "type": "prometheus",
          },
          "editorMode": "code",
          "expr": "resources_report{instance=\"localhost:7010\",job=\"TuGraph\",resouces_type=\"disk\",type=\"self\"}",
          "hide": false,
          "legendFormat": "{{type}}",
          "range": true,
          "refId": "B"
        },
        {
          "datasource": {
            "type": "prometheus",
          },
          "editorMode": "code",
          "expr": "resources_report{instance=\"localhost:7010\",job=\"TuGraph\",resouces_type=\"disk\",type=\"total\"}",
          "hide": true,
          "legendFormat": "{{type}}",
          "range": true,
          "refId": "C"
        },
        {
          "datasource": {
            "type": "__expr__",
          },
          "expression": "$C - $A - $B",
          "hide": false,
          "refId": "D",
          "type": "math"
        }
      ],
      "title": "disk",
      "transformations": [
        {
          "id": "configFromData",
          "options": {
            "applyTo": {
              "id": "byFrameRefID"
            },
            "configRefId": "config",
            "mappings": []
          }
        }
      ],
      "type": "piechart"
    },
    {
      "alert": {
        "alertRuleTags": {},
        "conditions": [
          {
            "evaluator": {
              "params": [
                90
              ],
              "type": "gt"
            },
            "operator": {
              "type": "and"
            },
            "query": {
              "params": [
                "A",
                "5m",
                "now"
              ]
            },
            "reducer": {
              "params": [],
              "type": "avg"
            },
            "type": "query"
          }
        ],
        "executionErrorState": "alerting",
        "for": "5m",
        "frequency": "1m",
        "handler": 1,
        "message": "[Production Graph Database Grafana] \nCPU usage rate exceeds 90%",
        "name": "CPU usage rate alert",
        "noDataState": "no_data",
        "notifications": [
          {
          }
        ]
      },
      "datasource": {
        "type": "prometheus",
      },
      "description": "",
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "axisLabel": "",
            "axisPlacement": "auto",
            "barAlignment": 0,
            "drawStyle": "line",
            "fillOpacity": 4,
            "gradientMode": "none",
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "lineInterpolation": "linear",
            "lineWidth": 1,
            "pointSize": 5,
            "scaleDistribution": {
              "type": "linear"
            },
            "showPoints": "auto",
            "spanNulls": false,
            "stacking": {
              "group": "A",
              "mode": "none"
            },
            "thresholdsStyle": {
              "mode": "off"
            }
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          },
          "unit": "percent"
        },
        "overrides": [
          {
            "matcher": {
              "id": "byName",
              "options": "graph_used"
            },
            "properties": [
              {
                "id": "color",
                "value": {
                  "fixedColor": "light-orange",
                  "mode": "fixed"
                }
              }
            ]
          },
          {
            "matcher": {
              "id": "byName",
              "options": "total_used"
            },
            "properties": [
              {
                "id": "color",
                "value": {
                  "fixedColor": "light-purple",
                  "mode": "fixed"
                }
              }
            ]
          },
          {
            "matcher": {
              "id": "byName",
              "options": "self"
            },
            "properties": [
              {
                "id": "color",
                "value": {
                  "fixedColor": "light-green",
                  "mode": "fixed"
                }
              }
            ]
          },
          {
            "matcher": {
              "id": "byName",
              "options": "total"
            },
            "properties": [
              {
                "id": "color",
                "value": {
                  "fixedColor": "light-purple",
                  "mode": "fixed"
                }
              }
            ]
          }
        ]
      },
      "gridPos": {
        "h": 14,
        "w": 12,
        "x": 0,
        "y": 16
      },
      "id": 6,
      "options": {
        "legend": {
          "calcs": [
            "min",
            "max",
            "mean",
            "last"
          ],
          "displayMode": "table",
          "placement": "bottom"
        },
        "tooltip": {
          "mode": "single",
          "sort": "none"
        }
      },
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
          },
          "editorMode": "code",
          "expr": "resources_report{instance=\"localhost:7010\",job=\"TuGraph\",resouces_type=\"cpu\",type=~\"total|self\"}",
          "hide": false,
          "legendFormat": "{{type}}",
          "range": true,
          "refId": "A"
        }
      ],
      "thresholds": [
        {
          "colorMode": "critical",
          "op": "gt",
          "value": 90,
          "visible": true
        }
      ],
      "title": "CPU usage rate",
      "type": "timeseries"
    },
    {
      "alert": {
        "alertRuleTags": {},
        "conditions": [
          {
            "evaluator": {
              "params": [
                10000
              ],
              "type": "gt"
            },
            "operator": {
              "type": "and"
            },
            "query": {
              "params": [
                "A",
                "5m",
                "now"
              ]
            },
            "reducer": {
              "params": [],
              "type": "avg"
            },
            "type": "query"
          }
        ],
        "executionErrorState": "alerting",
        "for": "5m",
        "frequency": "1m",
        "handler": 1,
        "message": "[Production Graph Database Grafana] Disk IO exceeds 10MB/S",
        "name": "Disk IO alert",
        "noDataState": "no_data",
        "notifications": []
      },
      "datasource": {
        "type": "prometheus",
      },
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "axisLabel": "",
            "axisPlacement": "auto",
            "barAlignment": 0,
            "drawStyle": "line",
            "fillOpacity": 7,
            "gradientMode": "none",
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "lineInterpolation": "smooth",
            "lineWidth": 1,
            "pointSize": 5,
            "scaleDistribution": {
              "type": "linear"
            },
            "showPoints": "auto",
            "spanNulls": false,
            "stacking": {
              "group": "A",
              "mode": "none"
            },
            "thresholdsStyle": {
              "mode": "off"
            }
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          },
          "unit": "bps"
        },
        "overrides": [
          {
            "matcher": {
              "id": "byName",
              "options": "read"
            },
            "properties": [
              {
                "id": "color",
                "value": {
                  "fixedColor": "super-light-green",
                  "mode": "fixed"
                }
              }
            ]
          },
          {
            "matcher": {
              "id": "byName",
              "options": "write"
            },
            "properties": [
              {
                "id": "color",
                "value": {
                  "fixedColor": "super-light-red",
                  "mode": "fixed"
                }
              }
            ]
          }
        ]
      },
      "gridPos": {
        "h": 14,
        "w": 12,
        "x": 12,
        "y": 16
      },
      "id": 2,
      "options": {
        "legend": {
          "calcs": [
            "min",
            "max",
            "mean",
            "last"
          ],
          "displayMode": "table",
          "placement": "bottom"
        },
        "tooltip": {
          "mode": "single",
          "sort": "none"
        }
      },
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
          },
          "editorMode": "builder",
          "expr": "resources_report{instance=\"localhost:7010\",job=\"TuGraph\",resouces_type=\"disk_rate\",type=~\"read|write\"}",
          "hide": false,
          "legendFormat": "{{type}}",
          "range": true,
          "refId": "A"
        }
      ],
      "thresholds": [
        {
          "colorMode": "critical",
          "op": "gt",
          "value": 10000,
          "visible": true
        }
      ],
      "title": "磁盘IO",
      "type": "timeseries"
    }
  ],
  "refresh": "",
  "schemaVersion": 36,
  "style": "dark",
  "tags": [],
  "templating": {
    "list": []
  },
  "time": {
    "from": "now-24h",
    "to": "now"
  },
  "timepicker": {
    "hidden": false,
    "refresh_intervals": [
      "10s"
    ]
  },
  "timezone": "",
  "title": "TuGraph Monitoring Web",
  "version": 20,
  "weekStart": ""
}

Verify the effect by refreshing the browser page. If the pie chart and line chart are displayed correctly, the configuration is completed.

3.Future Plans

Currently, visual monitoring only supports single-machine monitoring and can monitor performance indicators such as CPU, disk, network IO, and request QPS of the machine on which the service is located. In the future, monitoring for HA clusters will be implemented, and more meaningful indicators will be included in the monitoring scope.