Towards a robust cloud-based database backup and recovery platform

About the system that prevents data disasters from affecting Coupang’s critical online businesses

Coupang Engineering

Published in

Coupang Engineering Blog

9 min readJun 3, 2022

By Junzhao Zhang

This post is also available in Korean.

Data drives all important business decisions at Coupang, making it an irreplaceable resource for all our products and services. We use data at every point of the customer journey, from A/B testing to purchase funneling. Although database administration (DBA) engineers strive to provide foolproof database protection, unpredictable events may damage our data at any given moment. For instance, a minor physical disk failure may corrupt an entire dataset, or an erroneous operation may delete data permanently.

To handle such potential data disasters quickly and efficiently, the Database Engineering team developed a robust backup and recovery platform for our fellow engineers at Coupang that streamlines the entire backup process and automates steps that were previously manual.

Our platform is designed to securely prevent the loss of mission critical data and metadata in a stress-free manner.

In this post, we walk you through why we needed a new database backup and recovery platform, how we developed it, and how it is currently used in our day-to-day operations.

Limitations and background

Before diving in, we needed to identify the weaknesses in our existing backup methods. Previously, we used a traditional approach using cron jobs with a management tool for backup operations.

When we were a small start-up, this method adequately addressed our needs. However, as the amount of data we collected grew exponentially, cron job scripts became inefficient and difficult to maintain. Various teams had permission to modify the cron jobs, all without version control and centralized management of team access.

With the explosive amount of data gathered from our user growth, we found the following issues with the existing approach:

Low adaptability. A backup and recovery system must be adaptable to the varying data types and sizes of different businesses. For instance, backups using object storage such as Amazon Simple Storage Service (S3) is too slow to meet recovery requirements of time-sensitive, critical services. However, adopting other local storage or block storage options was difficult and manually conducted due to the absence of a centralized management system.
Low efficiency. Also due to lack of centralization, backup maintenance was difficult and often manual. For instance, DBA engineers had to manually ensure that each database was scheduled for a daily or hourly backup, which was time consuming. Furthermore, such dependency on manual operations left the system vulnerable to human errors, often resulting in failure during recovery.
Poor security. In addition to adaptability, backup nodes must automatically be recovered in case of exceptional failures, such as a hardware error. The traditional operation relied on backup scripts and on manual SSH login to the database server, compromising security and operational efficiency.

Overall, the original backup method lacked a centralized system for organizing and overseeing the cron job scripts and thus required manual interference at many stages, exposing it to human mistakes and inconsistencies.

Requirements and methods

We needed a backup platform that would not only ensure safe recovery without disrupting our online services, but also excel at serving our specific business needs. Upon review, we defined the following requirements:

Database services and related online services must not be affected during backup operations.
Both full backups and incremental backups are required due to the need to recover critical customer data and transaction data with proper recovery point objective (RPO) and recovery time objective (RTO).
Backups may be used to build new replicas for scale-out or adding subordinates, so it is important to have the ability to build new instances from backups quickly.

Backup methods

Considering the above requirements, we selected the following backup methods.

Hot backup
Because consistency and around-the-clock availability are critical to our business, we chose hot backup to meet the first requirement, instead of warm or cold backup. Hot backup keeps the database running during backup and is often referred to as online backup as it keeps online disruptions to a minimum.
Full backup with incremental backup
Full backup, or a complete backup of the database, is the most common backup method that can be utilized when recovering databases or building replicas. Although full backup offers the fastest recovery speed, each backup costs large amounts of disk space and time. To mitigate such issues, we supplemented full backup with an incremental backup, or a partial backup of recent changes to the database.
Remote backup
To ensure full use of our backup data upon recovery, we examined the methodologies related to storage locations and backup file formats. We opted to use remote backup using Amazon EBS instead of local backup, because the cutting-edge technology of EBS snapshot guarantees recovery in a matter of minutes. More details about EBS will follow in the next section.
Physical backup
For the format of our backup data, we settled on physical backup, which is the replication of logs and data files. Unlike physical backup, logical backup allows immediate inspection and verification because it backs up data as executable and readable SQL text files. However, it has a longer recovery time and is only suitable for small datasets. Because speedy backup and recovery are imperative to our business in case of an incident, we chose physical backup even though it meant we could only verify backup data after recovery.

Backup and recovery platform development

The goal of our database platform is to integrate existing backup methods and to standardize repetitive operations efficiently and safely. These goals were accomplished by automating manual processes, providing verification mechanisms, and integrating an operations log system. To supplement these goals, we created an automated platform with a user-friendly interface.

Platform design

In this section, we share parts of the platform’s underlying architecture and discuss some of the design choices we made to increase the efficiency, security, and adaptability of our backup operations.

Coupang’s database backup platform architecture — **Figure 1.** Coupang’s backup platform architecture

Technical architecture

For high modularity, speed, and security, the backend of our backup platform is designed to operate on system-level programs with a small binary distribution and a low memory footprint. In addition, MetaDB is a database to store operation configuration, schedules, and audit logs. The backend controller is integrated in the platform architecture using the underlying backup tool to manage backup processes. It also provides a standard RESTful API for both web UI and programming which can be utilized for further automation.

Backup operation workflow

Our platform provides a wide range of backup options for all our business needs. For instance, when DBA engineers create scheduled tasks for physical backup, they can choose between a full and incremental backup. They can also choose between different instances of database machine types, each of which strategically has different speed limits and adjustable parameters to avoid online business incidents caused by a heavy workload.

After a task is created, the backup is executed as scheduled, most often during low-traffic times to minimize the impact on service. However, the high IO load may still sometimes cause stability and replication lag problems.

To ensure stability and speed, we implemented automatic optimization features. For example, based on the MySQL host volume throughput, the backup tool throttles IO utilization to 80% of the production environment’s actual statistics and ensures it does not saturate resources and impact online services. Other variables such as the number of threads and S3 chunk sizes are automatically adjusted to MySQL’s data size. The optimization strategies can be tuned flexibly to fit various situations.

Storage methods

From a storage perspective, there are two methods for backup storage: EBS snapshot and S3.

For the EBS snapshot storage backup method, a backup task will first add a new fresh replica to the cluster. Then, the latest data in the cluster will be copied and synchronized to the replica. After data is synchronized to the replica, the EBS volume is detached and preserved. The advantage of this method is that it only takes a few minutes to quickly add a replica to a cluster and realize fast data recovery.

For the S3 storage backup method, files are encrypted and uploaded to S3 on the fly through the Unix pipeline without temp files to avoid the risk of insufficient disk space. To maximize security, the S3 bucket address and encryption keys are distinguished depending on different business VPCs or applications. Each backup node is equipped with an internal backup tool that encrypts the data using one of the many supported encryption algorithms.

By leveraging the flexibility of block storage available in cloud environment, we were able to offer a new backup option to our internal users that is not only effective in providing data redundancy but also achieves much faster recovery time than traditional backups. Internal users can make their choice and trade-off between their mean time to repair (MTTR) requirements and marginal costs.

Data recovery operations

The EBS snapshot backup method retains the most recent data snapshots in a detached EBS volume. Only the newly written data needs to be synchronized during data restoration since the last backup can be caught up with the latest data through binlog. As shown in Figure 2, the time of scaling out increases linearly in S3 when data volume increases, whereas in EBS snapshot backup, the time of scaling out is similar regardless of data size.

Quick restoration of a replica server is critical to handling failover of online resources. Recovering data from S3 backup is limited by data transfer bandwidth, which does not exceed 250M/s in the production environment. To fulfill our timeliness requirement, we chose to use EBS snapshot on our platform for critical services. Although EBS is more expensive, a replica server can be quickly restored within minutes — a trade-off between cost and demand we are willing to make for a swift recovery of our most critical services.

A comparison of data restoration times between AWS S3 and EBS snapshot at Coupang — **Figure 2.** A comparison of restoration times between S3 and EBS snapshot

Management interface

Backup management encompasses multiple processes such as backup creation, scheduling, execution, and recovery. In addition, Coupang engineers run multiple full and incremental backups daily on hundreds of clusters. To easily manage this complex process, our backup platform has a simple and easy-to-use interface that automates repetitive steps through a few simple clicks.

Task management

Coupang backup scheduler (left), backup task detail page (right) — **Figure 3.** Backup scheduler (left), backup task detail page (right)

Management of backup operations becomes easy with the backup scheduler, where our users can view backup task logs, execution times, current cluster machines, and modifications made. For more details on a certain task, users can go to the backup detail page which displays the backup type, file size, and more.

Task registration

In addition to making management easier, our platform makes executing the backup and recovery process painless. Users can register a new backup task by simply designating a cluster name, backup machine, and execution time. Execution times are given using cron expressions, the same as those used in crontabs, making the transition from cronjobs to our platform easy.

Similarly, users can register a new recovery task easily by creating a replica cluster with the latest backup data. Users also have the option to choose between recovery from EBS or S3.

Registering a backup task (left), creating a recovery job (right) on the Coupang data backup and recovery tool — **Figure 4.** Registering a backup task (left), creating a recovery job (right)

Backup data integrity validation

To verify the accuracy and effectiveness of backup sets, we periodically conduct data recovery and validation tests. Such tests add extra assurance to data reliability, but they must be conducted without impacting the production environment.

Our backup platform performs verification in the following steps:

A statement-based SQL query in the source divides tables into chunks to minimize the execution time for each query. Therefore, it can run on large tables without impact on service.
A lock is added on each chunk to ensure no update gets involved during running checksum.
MySQL functions are used on source instance to generate checksum per chunk.
The same checksum query is sent to replica instances to generate the check sum of each chunk on each replica instance.
The generated checksum values of the source and replica are compared to determine their consistency.

Conclusion

In this post, we outlined the shortcomings of the previous backup method, analyzed the needs of Coupang database environments, and discussed how the new platform enhanced backup and recovery efficiency and reliability.

Through the new platform, DBA engineers are liberated from manual operations such as maintaining cron jobs and bash scripts. Monitoring, alerting, and failsafe of backup processes are all managed automatically. In addition, security risks and operation incidents caused by such manual operations are also effectively prevented. Overall, the platform provides internal users of the Coupang database with a higher level of convenience and stability.

I would like to thank the DBA team, as the development of the backup platform would not have been possible without their continuous feedback on UI, robustness, and convenience.

If you are interested in researching and developing tools and platforms to improve our large-scale database services, check out our open positions.