Backup and Restore Alicloud Tablestore data using Datax

Rohit Tiwari
SCMP — Inside the Wonton
4 min readMay 28, 2020

Alicloud Tablestore (OTS) is a cost-effective and reliable NoSQL database service that can store and access large volumes of structured data in real-time, the only drawback is, that there is no official backup and restore solution offered with the service.

There is a case study that showcases the usage of Tablestore tunnel for different kinds of backups like Full, Incremental, Hot backups using Tablestore tunnel SDK. The case study looks promising if one has structured data in simple formats such as CSV, but for complex data sets, it would require writing custom Data Parser which is not a feasible option.

Requirement

At South China Morning Post, Our Engineering team uses Alicloud tablestore service for storing and accessing real-time data for their Product releases. Therefore we had to look into data backup and restore solutions and come up with a custom solution.

What is Datax?

DataX is an offline data synchronization tool/platform used within the Alibaba Group which enables efficient data synchronization between heterogeneous data sources including MySQL, SQL Server, Oracle, PostgreSQL, HDFS, Hive, HBase, OTS, ODPS Features.

DataX is responsible for connecting various data sources as an intermediate transmission carrier. When you need to access a new data source, you only need to connect the data source to DataX, and you can achieve seamless data synchronization with existing data sources.

As an offline data synchronization framework, DataX uses Framework + plugin architecture. It abstracts the data source by reading and writing into a Reader / Writer plug-in, which is incorporated into the entire synchronization framework.

  • Reader: Reader is a data collection module, responsible for collecting data from the data source and sending the data to the Framework.
  • Writer: Writer is a data writing module, responsible for continuously fetching data from Framework and writing data to the destination.
  • Framework: Framework is used to connect reader and writer as a data transmission channel between the two, and handles core technical issues such as buffering, flow control, concurrency, and data conversion.

Each synchronization operation between data sources is run as a job.

Using Datax as Backup/Restore solution

We considered Datax for data synchronization and MongoDB as a data source for storing OTS backup.

We chose MongoDB over other Datasources based on cost, manageable, and data compatibility.

The challenging part was to manage the everyday backups under MongoDB and to select the latest backup at the time of restore. For this, we wrote a Nodejs program to handle the MongoDB databases for storing daily backup, based on the required retention period and to delete the older databases.

Following tasks are performed before the backup job is initialized:

  1. Delete the backup older than the retention period.
  2. Rename the last backup to the previous date, to keep a single point of restore.

Here is the code :

To begin with, we decided to run the daily backup as a Kubernetes job, and for that, we created a Docker image using the datax binary and Nodejs program to make a single embedded solution.

Next, we had to create a Datasource reader and writer configuration files.

Here are the plugins listed:

With the help of documentation provided under the Github alicloud/Datax project, plugin configuration was quite simple.

Here are the sample plugin configurations:

Note: Configuration part is simple but important and must be handled carefully as it can modify/update data at the source.

Once plugin configuration is done, we just need to create Kubernetes manifest files for running backup and restore on the Kubernetes Cluster as a K8s Job.

The good thing is we have already created K8s manifests under the Github project scmp/alicloud-datax

The repository contains everything required to set up the Backup and Restore job.

You can refer to the README.md for more details.

Finally, Set up CI to handle the deployment of backup and restore jobs under K8s. (Not included in this demo)

Go ahead give it a try…

In simple steps:

  • Clone the repository
  • Build your own Docker image
  • Configure the Datasource configuration files
  • Update K8s manifests
  • Setup CI for deployment

I hope you find this solution useful.

Have a Good day!!!

--

--