How we Improved Database Development and CI/CD with Storage Snapshot

Published in

Agoda Engineering & Design

8 min readMar 31, 2024

Database development is an integral part of software development. However, it demands considerable effort and can present challenges throughout the development and testing stages. At Agoda, we’ve adopted specific methodologies and processes to accelerate and simplify the development workflow. In this blog post, we’ll outline the strategies, insights, and implementation techniques of this process.

Navigating Database Development Challenges

Typically, developers looking to develop or test their work utilize a shared testing database. They create branches for their work, where they make the necessary changes or run tests on their branch. Once satisfied with the results, they merge these changes into the main branch. This process can be manual or automated in CI/CD flow.

On the other hand, testers run different functionality tests on their branch. However, they need to still test against a (standard) shared table or database. Each test needs a specific data set. This can be hard to run simultaneously. In such situations, testers face various challenges:

Testers find it difficult to ensure that they are testing over the expected data.
Testing may fail due to unexpected changes from others.
With Git, you can isolate your work and test for applications by working on a branch. However, it cannot be done easily for the database.
Reverting data or schema changes on the database is substantial and time-consuming.

You can create your testing area by setting up either a new database or using a docker container to deploy your package into your testing area. This method works well for smaller databases, where the effort required for updates and maintenance is minimal. However, suppose you want to test with an extensive data set. In that case, it is unlikely to create hundreds of gigabytes or even terabytes to include such an extensive data set for testing as the effort it takes to maintain, load, reload, or roll back is enormous. This can slow down development velocity.

How do we deal with this problem?

We have identified the core of the problem; the database is shared and not isolated. Our solution is to allow developers to create isolated copies of the database on-demand.

We have developed a platform called “DB Snapshot”, which provides multi-version snapshots of the source databases. The source can either be on the test database or the production database. Snapshot has a static view of schema/data when taken. We can create multiple copies from the snapshot. Then, the developers and QA engineers can enjoy their work on isolated copies. We can also test the complete data set from the production snapshot, eliminating the need for manual preparation (load, reload, and clean).

The process is straightforward: create a new copy, perform the necessary work, and discard it once the tasks are completed or merged.

Agoda DB Snapshot

Let’s demonstrate our implementation. It consists of three main parts.

Source databases which are SQL Server in this case.
A storage cluster (Ceph) stores snapshots and copies of databases.
Dedicate/isolate user database (inside the container), which are mapped with copies of databases from the storage cluster.

Ship data from the source into a storage cluster

We generate a block device image (RBD) to function as remote disks. Ceph provides copy-on-write snapshot, layering, and image cloning features to generate a multi-version snapshot and copy (clone). Snapshot is just a static view that isn’t writeable. A clone is created from the snapshot and provides the same data as the snapshot from which it is cloned. It is completely isolated from others and is writeable.

We choose shipping data/logs by restoring data from those backups. The process is;

a) Simple. It does not need advanced features and has no direct interaction with production instances.

b) It does not depend on the source base’s platform, whether SQL Server on Linux or Windows.

c) All production databases do an incremental backup and are securely stored in the file server.

We create our log shipping tool to read backup files and restore them. The tool checks the filename’s hierarchy of directories and timestamps to understand the structure of instances/databases and the sequence of files to restore. It also checks how backup files or directories are stored and how to read and restore them accurately.

We incrementally restore databases into SQL Server containers mounted with mapped RBD disks and create dedicated disk images for each database instance.

Snapshot build workflow

We create a multi-version snapshot from databases already in the storage using a combination of the snapshot, layering, and cloning. We use Apache Airflow as a scheduler and workflow controller. It helps us define workflow and configuration as a code that allows us to build a complex workflow.

The following are the workflow steps on each disk image:

Stop restoration process — is to check and stop restoration (from the log shipping tool) when we need to take a snapshot. It is necessary to check if logs restoration happens over an expected time frame and stop activities during the snapshot process because it may cause a corrupt disk block.
Create lv.1 snapshot — We take lv.1 snapshot to create a static view of data at a specific time. We create multiple snapshots on one disk image to have a multi-version of the snapshots.
Create lv.1 clone — In this step, we create a clone image from lv.1 snapshot. The cloned image has the same data as its corresponding snapshot but is readable and writeable. The clone gets plugged into a temporary SQL Server container. From here, it can run several processes against lv.1 clone. For example, database recovery (databases when a snapshot is in a cold restoring state), bringing them online, running some validation tests, resetting login passwords, etc.
Create lv.2 snapshot — We take lv.2 snapshot on the lv.1 clone prepared from the previous step. At this stage, the user is ready to copy it.
Create image for catalog — in the final step, we create a small Alpine docker image that stores only the snapshot name created in the previous step and pushes it to our internal repository. Users can explore provided snapshots by listing available images and tags instead of directly listing snapshots from the storage cluster. For example, db_instance_a:2021.01.01 is a snapshot of db_instance_a at 2021.01.01. This image is used during deployment when the user requests a copy.

Request database copy

The snapshot build workflow produces snapshots and puts them on the catalog (we expose them as images and tags), and users can choose the provided snapshots to create a copy.

Copying the database (inside the container) is dynamically spawned on Kubernetes upon the user’s request. Leveraging Helm, we create a deployment package that is easy to deploy and dynamically configurable using templating and Values files.

When deployment starts, it deploys the init container using the specified image and tags from the catalog to let deployment know which snapshot the user wants to clone first. Then, it creates a new clone image from the snapshot (Lv.2) on the fly. And it maps that new clone image into a pod that is being deployed. Lastly, it starts an app container (database container) that uses database files mapped from newly cloned images.

Users can deploy charts with overridden configuration in the Values file to spawn a new pod and SQL server container mounted with data from the chosen clone image. They can also add this as a dependency of their application chart to build end-to-end deployment for automating CI/CD or testing pipelines.

At Agoda, we have a self-service platform, “KOrca”, to help users manually request temporary namespace or deployment for development and testing in a few clicks by choosing a pre-defined package. We also provide DB snapshots via this tool to easily request database copies. For further information about KOrca and its capabilities, refer to our blog post Developing And Testing Against a Bunch of Microservices in Agoda.

KOrca provides a catalog of our database services and available snapshots exposed as images and tags. When users make a deployment, the connection endpoint of the new copy gets produced accordingly.

Our use cases

We have implemented DB Snapshot for over 100 Terabytes of testing/QA and production SQL Server databases. On average, we handle around 400 requests for creating database copies daily, with peaks reaching up to 1,200 copies in a single day.

DB Snapshot in the development life cycle

DB Snapshot has become a part of our development life cycle. Starting from development, developers can explore data or schema and make changes by requesting their copy. They can create copies along with the application or database code development branch. A copy is completely isolated exclusively for a specific branch. We can execute automated testing, and it is very convenient to run several test iterations since creating new copies is quick and easy. We can thrash them upon task completion.

Our staging databases also use snapshot/copy images for the deployment stage. Now, it does not need the effort to re-setup, reload, and roll back again. We use staging for pre-deployment testing. Every deployment package must pass validation on staging, e.g., pre-deployment check, schema migration check, and performance validation before going live. And yes, it tests against a complete set of production data.

Multi-versions of production snapshots can also be used in the recovery scenario when someone or an application updates production data incorrectly. We can quickly spawn a copy from the snapshot with the version before the incident and fix only the affected data using the snapshot instead of completely restoring it.

Conclusion

There are many more open opportunities for using database snapshots. It helps us improve the quality of testing and deployment and reduces the time needed to maintain the testing and staging environment.

This concept could be adjusted and applied to different use cases or implementations. Other Enterprise SAN or object storage also provide similar features but may have different performance and capabilities. It all depends on your requirements, the database you use, and your familiarity with tools and technologies.