Data Importing

This document describes the data import function of TuGraph. These include delimiters in CSV format, sample formats for jsonline, and two modes for importing online and offline.

1. Introduction

After the installation is successful, you can use the lgraph_import batch import tool to import existing data into TuGraph. lgraph_import supports importing data from CSV files and JSON data sources.

sonline format, a row of a json array string

CSV format

[movies.csv]
id, name, year, rating
tt0188766,King of Comedy,1999,7.3
tt0286112,Shaolin Soccer,2001,7.3
tt4701660,The Mermaid,2016,6.3

jsonline format

["tt0188766","King of Comedy",1999,7.3]
["tt0286112","Shaolin Soccer",2001,7.3]
["tt4701660","The Mermaid",2016,6.3]

TuGraph supports two data importing modes:

  • offline:Read the data and import it into a data file for the specified server, it should only be done when the server is offline.

  • online:Read the data and send it to the running server, which then imports the data into its database.

2.CSV file format delimiter

CSV delimiters can be single-character or multi-character strings, which cannot contain 'r' or 'n'. Note that different shells handle input strings differently, so different escape handling may be required for different shell input parameters.

In addition, ‘lgraph_import’ also supports the following escape characters for entering special symbols:

Escape character

Instructions

\

The backslash\\

\a

ASCII code 0x07

\f

form-feed,ASCII code 0x0c

\t

Horizontal tab characters, ASCII code 0x09

\v

Vertical tab characters, ASCII code 0x0b

\xnn

Two hexadecimal digits representing one byte,example \x9A

\nnn

A three-digit octal number representing a single byte,example \001, \443,data range cannot exceed 255

Demo:

$ ./lgraph_import -c ./import.config --delimiter "\001\002"

3.The configuration file

`lgraph_import`tool configures the environment using the specified configuration file. The configuration file describes the paths to the input files, the vertices/edges they represent, and the vertex/edge format.

3.1.Configuration File Format

The configuration file consists of two parts: schema and files. The ‘schema’ part defines label, and the ‘files’ part describes the data files to be imported.

3.1.1.The keyword

  • schema (Array)

    • label(required,string)

    • type(required, VERTEX or EDGE)

    • properties(Array,required for vertex,If there is no property for the edge, you can leave it unconfigured)

      • name(required, string)

      • type (required,,BOOL,INT8,INT16,INT32,INT64,DATE,DATETIME,FLOAT,DOUBLE,STRING,BLOB)

      • optional(optional,Indicates that the field can be configured or not configured)

      • index(optional,Whether the field needs to be indexed)

      • unique(optional, whether the field indexed and is of type unique)

      • pair_unique(optional, whether the field indexed and is of type pair_unique)only one unique and pair_unique can be set. Setting up and running at the same time will terminate due to an input exception

    • primary (vertex-only configuration, specify a property that uniquely identifies a point in the primary key field)

    • temproal (edge-only configuration, optional, specifies the timestamp field for storage tier sorting)

    • temporal_field_order (edge-only configuration, optional, the default value is ASC, indicating the ascending order, or DESC, indicating the descending order)

    • constraints (edge-only configuration, optional, array form, start and end labels, unconfigured or empty means unrestricted)

    • detach_property (optional, both Edge and Vertex can config,default is false. true means the property data is stored in detached model, which can reduce io read amplification in cases with less memory and much property data.)

  • files (Array)

    • path(required,string,The value can be a file path or a directory path. If it is a directory, all files in the directory will be imported. Ensure that they have the same schema)

    • header(Optional, numeric, header in the first few lines of the file, or 0)

    • format(required,JSON or CSV)

    • label(required,string)

    • columns(Array)

      • SRC_ID (Special string,only on the edges,That means this column is the source data)

      • DST_ID (A special string, only on the edges, indicates that this column is the destination data)

      • SKIP   (A special string to skip this column of data)

      • [property]

    • SRC_ID (Edge-only configuration. The value is the start label)

    • DST_ID (Edge-only configuration. The value is the destination point label)

3.1.2. Index length

Because TuGraph has limits on the length of keys, unique indexes do not allow the establishment of indexes that exceed the limit length. Non-unique indexes will truncate attributes that exceed the length limit, and when traversing non-unique indexes through iterators, the key obtained It is also truncated and may not be consistent with expectations. The truncation length is different for different types of non-unique indexes.

3.1.2.1.unique index

A unique index is globally unique, and the maximum length of the index key is 480 bytes. Primary is a special unique index, so the maximum key length is also 480 bytes, and no index can be created if it exceeds 480 bytes.

3.1.2.2.pair_unique index

A pair_unique index refers to a unique index between two nodes. This type of index can only be created in the edge schema. This type of index adds the vid of the source node and the target node after the user-specified key. Each vid is 5 bytes. length. Therefore, the maximum key length is 470 bytes, and indexing cannot be established if it exceeds 470 bytes.

3.1.2.3. Non-unique index

A non-unique index refers to an index that neither unique nor pair_unique is set to 1. In the implementation of TuGraph, one key of this type of index may be mapped to multiple values. In order to speed up search and writing, the user-specified The key is followed by the maximum value in a set of vid or euid. For non-unique indexes created in nodes, the key is followed by vid, and each vid is 5 bytes in length, so the maximum length is 475 bytes. For non-unique indexes created in edges, the key is followed by euid, and each euid is 24 bytes long, so the maximum length is 456 bytes. If the index key exceeds the corresponding length, it will be automatically truncated.

3.2.Example Configuration File

{
  "schema": [
    {
      "label": "actor",
      "type": "VERTEX",
      "properties": [
        { "name": "aid", "type": "STRING" },
        { "name": "name", "type": "STRING" }
      ],
      "primary": "aid"
    },
    {
      "label": "movie",
      "type": "VERTEX",
      "properties": [
        { "name": "mid", "type": "STRING" },
        { "name": "name", "type": "STRING" },
        { "name": "year", "type": "INT16" },
        { "name": "rate", "type": "FLOAT", "optional": true }
      ],
      "primary": "mid",
      "detach_property": false
    },
    {
      "label": "play_in",
      "type": "EDGE",
      "properties": [{ "name": "role", "type": "STRING", "optional": true }],
      "constraints": [["actor", "movie"]]
    }
  ],
  "files": [
    {
      "path": "actors.csv",
      "header": 2,
      "format": "CSV",
      "label": "actor",
      "columns": ["aid", "name"]
    },
    {
      "path": "movies.csv",
      "header": 2,
      "format": "CSV",
      "label": "movie",
      "columns": ["mid", "name", "year", "rate"]
    },
    {
      "path": "roles.csv",
      "header": 2,
      "format": "CSV",
      "label": "play_in",
      "SRC_ID": "actor",
      "DST_ID": "movie",
      "columns": ["SRC_ID", "role", "DST_ID"]
    }
  ]
}

For the above Example Configuration File, three labels defined: two point types’ actor ‘and’ movie ‘, and one edge type ‘role’. Each label describes: the name of the label, the type (dot or edge), which attribute fields are available, and the type of each field. For points, the primary field also defined; For edges, the constraints field also defined, which limits the starting and ending points of the edges.

It also describes three data files, two dot data files’ actors.csv ‘and’ movies.csv ‘, and one edge data file ‘roles.csv’. Each section describes the path of the file, the format of the data, the first few lines of the header, the label of the data, and the corresponding field of each column in each row of data in the file.

For the above configuration files, the import tool will first create labels’ actor ‘, ‘movie’, and ‘role’ in TuGraph, and then perform data import of the three files.

4.Offline full import

The offline mode can be used only on offline servers. An offline import creates a new graph, so it’s more suitable for your first data import on a newly installed TuGraph server.

To use the ‘lgraph_import’ tool in offline mode, you can specify the ‘lgraph_import –online false’ option. To see the command-line options available, use ‘lgraph_import –online false –help’ :

$ ./lgraph_import --online false -help
Available command line options:
    --log               Log file to use, empty means stderr. Default="".
    -v, --verbose       Verbose level to use, higher means more verbose.
                        Default=1.
    ...
    -h, --help          Print this help message. Default=0.

Command line arguments:

  • -c, –config_file config_file: Import the configuration file name in the following format:

  • –log log_dir: Log directory. The default is an empty string, in which case the log information output to the console.

  • –verbose 0/1/2: Log level. The higher the log level, the more detailed the output information is. The default value is 1.

  • -i, –continue_on_error true/false: If an error occurs, skip the error and continue. The default is false. If an error occurs, exit.

  • -d, –dir {diretory}:The database directory to which the import tool will write data. Default is’./db ‘.

  • –delimiter {delimiter}: Data file separator. This is used only when the data source is in CSV format. The default is’, ‘。

  • -u, –username {user}: Username of the database. You need to be an administrator to perform offline import.

  • -p, –password {password}: Specifies the password of the database user

  • –overwrite true/false: Whether to overwrite data. When set to true, the data directory overwritten if it already exists. Default to ‘false’.

  • -g, –graph {graph_name}: Specify the kind of graph to import.

  • -h, –help: The help information displayed.

4.1.Offline Import Example

In this example, we use the movie-actor data described above to demonstrate the use of the import tool. The data to be imported is divided into three files: 'movies.csv', 'actors.csv', and 'roles.csv'.

‘movies.csv’ contains information about movies, where each movie has an id (as a primary key for retrieval), and each movie also has attributes such as title, year, and rating. (Data from IMDb).

  [movies.csv]
  id, name, year, rating
  tt0188766,King of Comedy,1999,7.3
  tt0286112,Shaolin Soccer,2001,7.3
  tt4701660,The Mermaid,2016,6.3

The jsonline format is as follows: All fields can be strings, which will be converted to the corresponding type when imported

["tt0188766","King of Comedy",1999,7.3]
["tt0286112","Shaolin Soccer",2001,7.3]
["tt4701660","The Mermaid",2016,6.3]
["tt0188766","King of Comedy","1999","7.3"]
["tt0286112","Shaolin Soccer","2001","7.3"]
["tt4701660","The Mermaid","2016","6.3"]

actors.csvIt contains information about the actors. Each actor also has an id, as well as properties such as name.

  [actors.csv]
  id, name
  nm015950,Stephen Chow
  nm0628806,Man-Tat Ng
  nm0156444,Cecilia Cheung
  nm2514879,Yuqi Zhang

The corresponding jsonline format is as follows:

["nm015950","Stephen Chow"]
["nm0628806","Man-Tat Ng"]
["nm0156444","Cecilia Cheung"]
["nm2514879","Yuqi Zhang"]

roles.csvIt contains information about which role an actor played in which movie. Each row records a character played by a given actor in a given movie, corresponding to an edge in the database.SRC_ID and DST_ID are the source and target vertices of the edge, which are the ‘primary’ properties defined in ‘actors.csv’ and ‘movies.csv’, respectively.

  [roles.csv]
  actor, role, movie
  nm015950,Tianchou Yin,tt0188766
  nm015950,Steel Leg,tt0286112
  nm0628806,,tt0188766
  nm0628806,coach,tt0286112
  nm0156444,PiaoPiao Liu,tt0188766
  nm2514879,Ruolan Li,tt4701660

The corresponding jsonline format is as follows:

["nm015950","Tianchou Yin","tt0188766"]
["nm015950","Steel Leg","tt0286112"]
["nm0628806",null,"tt0188766"]
["nm0628806","coach","tt0286112"]
["nm0156444","PiaoPiao Liu","tt0188766"]
["nm2514879","Ruolan Li","tt4701660"]

‘configuration file import.conf’, notice that there are two HEADER lines in each file, so we need to specify the ‘HEADER=2’ option.

{
  "schema": [
    {
      "label": "actor",
      "type": "VERTEX",
      "properties": [
        { "name": "aid", "type": "STRING" },
        { "name": "name", "type": "STRING" }
      ],
      "primary": "aid"
    },
    {
      "label": "movie",
      "type": "VERTEX",
      "properties": [
        { "name": "mid", "type": "STRING" },
        { "name": "name", "type": "STRING" },
        { "name": "year", "type": "INT16" },
        { "name": "rate", "type": "FLOAT", "optional": true }
      ],
      "primary": "mid"
    },
    {
      "label": "play_in",
      "type": "EDGE",
      "properties": [{ "name": "role", "type": "STRING", "optional": true }],
      "constraints": [["actor", "movie"]]
    }
  ],
  "files": [
    {
      "path": "actors.csv",
      "header": 2,
      "format": "CSV",
      "label": "actor",
      "columns": ["aid", "name"]
    },
    {
      "path": "movies.csv",
      "header": 2,
      "format": "CSV",
      "label": "movie",
      "columns": ["mid", "name", "year", "rate"]
    },
    {
      "path": "roles.csv",
      "header": 2,
      "format": "CSV",
      "label": "play_in",
      "SRC_ID": "actor",
      "DST_ID": "movie",
      "columns": ["SRC_ID", "role", "DST_ID"]
    }
  ]
}

Using the import configuration file, we can now import data using the following command:

$ ./lgraph_import
        -c import.conf             # Read configuration information from import.conf
        --dir /data/lgraph_db      # Store the data in /data/lgraph_db
        --graph mygraph            # Import the graph named mygraph

Notice

  • If a graph named ‘mygraph’ already exists, the import tool will print an error message and exit. To force a graph to be overwritten, use the ‘–overwrite true’ option.

  • Configuration and data files must be stored in UTF-8 encoding (or normal ASCII encoding, which is a subset of UTF-8). If any file uses an encoding other than UTF-8 (for example, UTF-8 with BOM or GBK), the import will fail and output a profiler error.

5.Online incremental Import

The online import mode can be used to import a batch of files into an already running instance of TuGraph. This is handy for processing incremental batch updates that typically occur at fixed intervals. The ‘lgraph_import –online true’ option enables the import tool to work in online mode. Like offline mode, online mode has its own set of command-line options, which can be printed using the ‘-h, –help’ options:

$ lgraph_import --online true -h
Available command line options:
    --online            Whether to import online.
    -h, --help          Print this help message. Default=0.

Available command line options:
    --log               Log file to use, empty means stderr. Default="".
    -v, --verbose       Verbose level to use, higher means more verbose.
                        Default=1.
    -c, --config_file   Config file path.
    -r, --url           DB REST API address.
    -u, --username      DB username.
    -p, --password      DB password.
    -i, --continue_on_error
                        When we hit a duplicate uid or missing uid, should we
                        continue or abort. Default=0.
    -g, --graph         The name of the graph to import into. Default=default.
    --skip_packages     How many packages should we skip. Default=0.
    --delimiter         Delimiter used in the CSV files
    --breakpoint_continue
                        When the transmission process is interrupted,whether
                        to re-transmit from zero package next time. Default=false
    -h, --help          Print this help message. Default=0.

The configuration related to the file specified in the configuration file, and its format is exactly the same as’ offline mode ‘. However, instead of importing the data into a local database, we are now sending the data to a running TuGraph instance, which is typically running on a different computer than the client machine on which the import tool is running. Therefore, we need to specify the URL, DB user, and password for the HTTP address of the remote machine.

If the user and password are valid, and the specified graph exists, the import tool sends the data to the server, which then parses the data and writes it to the specified graph. The data will be sent in a packet of about 16 MB, interrupted at the nearest newline. Each package imported atomically, which means that if the package is successfully imported, all data is successfully imported; otherwise, none of the data enters the database. If ‘–continue_on_error true’ is specified, data integrity errors are ignored and offending lines are ignored. Otherwise, the import stops at the first error package and prints out the number of packages that have been imported. In this case, the user can modify the data to eliminate errors, and then use ‘–skip_packages N’ to redo the import to skip the imported packages.

6. Online full import

Online full import can be used to import a batch of files into a running TuGraph instance. TuGraph supports importing two different types of data into the instance online:

  1. Original data files of the same type as the offline import (csv, etc.)

  2. The underlying storage file of TuGraph is the data.mdb file. This file can be generated by offline import, or it can be a file of other TuGraph db. The applicable scenarios of these two methods are different. The first method is to directly import data into TuGraph. The advantage is that it is automatically imported at one time and the operation steps are simple. However, an offline import thread will be started on the server side, which is only suitable for small-scale data import in a stand-alone situation. The second is to import the prepared underlying storage files into TuGraph. Although the mdb file needs to be prepared in advance, it does not have high requirements on system resources. It also supports remote download and file import, which is very convenient and suitable for high availability. Online import of schema or large-scale data.

6.1 Import from original data

The execution method of importing from the original data is to send an import request to the running TuGraph instance. After receiving the request, the instance first uses offline import (V3) to import the data into a temporary db, and then creates a new subgraph in the instance and The data files of the temporary db are migrated to the new subgraph, and finally the metadata of the instance is refreshed. Compared with online incremental import, online full import has higher performance. The lgraph_import --online true --online_type 1 option enables the import tool to fully import online. Like offline mode, online mode has its own set of command line options, which can be printed using the -h, --help option: Online full import can be used to import a batch of files into an already running TuGraph instance. The execution method is to send an import request to the running TuGraph instance. After receiving the request, the instance first uses offline import (V3) to import the data into a temporary db, and then creates a new subgraph in the instance and stores the data in the temporary db. The files are migrated to the new subgraph, and finally the instance’s metadata is refreshed. Compared with online incremental import, online full import has higher performance and is suitable for processing large-scale data. The lgraph_import --online true --full true option enables the import tool to fully import online. Like offline mode, online mode has its own set of command line options, which can be printed using the -h, --help option:

$ lgraph_import --online true --online_type 1 -h
Available command line options:
    --online_type       The type of import online, 0 for increment, 1 for full
                        import data,2 for full import file. Default=0.
    --v3                Whether to use lgraph import V3. Default=1.
    -h, --help          Print this help message. Default=0.

Available command line options:
    --online_type       The type of import online, 0 for increment, 1 for full
                        import data,2 for full import file. Default=0.
    -h, --help          Print this help message. Default=0.

Available command line options:
    -c, --config_file   Config file path.
    -r, --url           DB REST API address.
    -u, --user          DB username.
    -p, --password      DB password.
    -i, --continue_on_error
                        When we hit a duplicate uid or missing uid, should we
                        continue or abort. Default=0.
    -g, --graph         The name of the graph to import into. Default=default.
    --delimiter         Delimiter used in the CSV files. Default=,.
    --log               Log dir to use, empty means stderr. Default="".
    -v, --verbose       Verbose level to use, higher means more verbose.
                        Default=1.
    --overwrite         Whether to overwrite the existing DB if it already
                        exists. Default=0.
    --parse_block_size  Block size per parse. Default=8388608.
    --parse_block_threads
                        How many threads to parse the data block. Default=5.
    --parse_file_threadsHow many threads to parse the files. Default=5.
    --generate_sst_threads
                        How many threads to generate sst files. Default=15.
    --read_rocksdb_threads
                        How many threads to read rocksdb in the final stage.
                        Default=15.
    --vid_num_per_reading
                        How many vertex data to read each time. Default=10000.
    --max_size_per_reading
                        Maximum size of kvs per reading. Default=33554432.
    --compact           Whether to compact. Default=0.
    --keep_vid_in_memoryWhether to keep vids in memory. Default=1.
    --enable_fulltext_index
                        Whether to enable fulltext index. Default=0.
    --fulltext_index_analyzer
                        fulltext index analyzer. Default=StandardAnalyzer.
                        Possible values: {<2>: SmartChineseAnalyzer,
                        StandardAnalyzer}
    -h, --help          Print this help message. Default=0.

The relevant configuration of the file is specified in the configuration file, and its format is exactly the same as in offline mode. However, instead of importing the data into a local database, we now import the data into a running TuGraph instance, which is typically running on a different machine than the client machine running the import tool. Therefore, we need to specify the URL, DB user and password of the remote computer’s HTTP address. Moreover, the configuration file (config_file parameter) is required to be the uri path on the TuGraph instance machine, and its file configuration is also required to be the absolute path of the resource on the TuGraph instance machine.

If the user and password are valid, the import tool will perform an online full import on the server side. If the graph you want to import already exists, you can use the --overwrite true option to force overwriting of the subgraph.

6.2 Import from database file

Although the full online import of original data is simple to operate and has high performance, it requires high server resources and takes a long time. A more general way is to first use offline import to import the subgraph in an empty db, obtain the data.mdb file, and then import the file into the TuGraph service online. How to use it is as follows:

$ ./lgraph_import --online true --online_type 2 -h
Available command line options:
    --online            Whether to import online. Default=0.
    --v3                Whether to use lgraph import V3. Default=1.
    -h, --help          Print this help message. Default=0.

Available command line options:
    --online_type       The type of import online, 0 for increment, 1 for full
                        import data,2 for full import file. Default=0.
    -h, --help          Print this help message. Default=0.

Available command line options:
    -r, --url           DB REST API address.
    -u, --user          DB username.
    -p, --password      DB password.
    -g, --graph         The name of the graph to import into. Default=default.
    --path              The path of data file.
    --remote            Whether to download file from remote server. Default=0.
    -h, --help          Print this help message. Default=0.

In addition to the url, user and password parameters used in ordinary online import, the online full import method imported from a database file uses the graph parameter to specify the name of the imported subgraph, the path parameter to specify the file path, and remote to specify whether the file exists remotely or locally. If it is a local file, you need to ensure that all nodes in the HA cluster have the file in the path. If it is a remote file, it will be downloaded first and then imported. It should be noted that since there is only one copy of data.mdb, it is necessary to ensure that the environment of each node of HA and the machine where data.mdb is generated offline are completely consistent to ensure that no environmental problems will occur.