Improving Database Development, CI/CD with Storage Snapshot

Published in

Agoda Engineering & Design

8 min readFeb 6, 2022

Database (DB) Development is a mandatory part of software development. However, DB development consumes a lot of effort and can cause difficulties during the development or testing stage. Agoda has put a methodology and processes to make development faster and easier. In this blog, I will share our strategies, experiences and techniques of implementation.

Developers face multiple problems with database development.

In a typical scenario, developers who what to develop or test their work use a shared testing database. Developers create branches for their work, and each developer makes required changes or runs tests on their branch. Once satisfied with the results, the developers merge the changes back to the main branch. This process can be manual or automated in CI/CD flow.

On the other hand, testers run different functionality tests on their branch. However, they need to still test against a (standard) shared table or database. Each test needs a specific data set. This can be hard to run simultaneously. In such situations, testers face various challenges.

Testers find it difficult to ensure if they are testing over the expected data.
Testing may fail due to unexpected changes from others
With Git, you can isolate your work and testing for application by working on a branch. But cannot do that easily for the database
Reverting data or schema changes on the database is substantial and time-consuming

You can create your testing area by setting up either a new database or using a docker container to deploy your package into your testing area. This is feasible when the database is still small and there is little effort to update and maintain the testing area. However, suppose you want to test with big data set. In that case, it is unlikely to create hundreds of gigabytes or even terabytes to include such an extensive data set for testing as the effort it takes to maintain, load, reload, or rollback is enormous. We are still to slow development velocity down.

How do we deal with this problem?

At Agoda, we get to the core of the problem; the database is shared and not isolated. We think of allowing developers to create isolated copies of the database on-demand.

We have implemented a platform we call “DB Snapshot”, which provides multi-version snapshots of the source databases. The source can either be on the test database or the production database. Snapshot has a static view of schema/data when taken. We can create multiple copies from the snapshot. Then, the developers and QA engineers can enjoy their work on isolated copies. We can also test over complete data set from production snapshot without bothering to prepare manually (load, reload and clean).

It is as simple as creating a new copy; do the work on that copy and throw it away when complete or merged.

Agoda DB Snapshot

Let me demonstrate our implementation. It consists of three main parts.

Source databases which are SQL Server in this case.
Storage cluster (Ceph) for storing snapshots and copies of databases
Dedicate/isolate user database (inside the container), which are mapped with copies of databases from the storage cluster

Ship data from the source into a storage cluster

We create a block device image (RBD) and use them as remote disks. Ceph provides copy-on-write snapshot, layering, and image cloning features to generate a multi-version snapshot and copy (clone). Snapshot is just a static view that isn’t writeable. Clone is created from the snapshot and provides the same data as the snapshot, from which it is cloned. It is completely isolated from others and is writeable.

We choose shipping data/logs by restoring data from those backups. The process is

a) Simple. It does not need advanced features and has no direct interaction with production instances.

b) It does not depend on the source base’s platform, whether SQL Server on Linux or Windows.

c) All production databases do an incremental backup and are securely stored in the file server.

We create our log shipping tool to read backup files and restore them. The tool checks hierarchy of directories and timestamps in the filename to figure out the structure of instances/databases and the sequence of files to restore. It also checks how backup files or directories are stored and how to read and restore them accurately.

We incrementally restore databases into SQL Server containers mounted with mapped RBD disks and create dedicated disk images for each database instance. That is how we ship databases into the storage cluster at Agoda.

Snapshot build workflow

We create a multi-version snapshot from databases already in the storage using a combination of the snapshot, layering, and cloning. As a scheduler and workflow controller, we use Apache Airflow. It helps us define workflow and configuration as a code that allows us to build a complex workflow.

Followings are steps of workflow on each disk image

Stop restoration process — is to check and stop restoration (from the log shipping tool) when we need to take a snapshot. It is necessary to check if logs restoration happens over an expected time frame; stop activities during the snapshot process because it may cause a corrupt disk block.
Create lv.1 snapshot — here, we take lv.1 snapshot to create a static view of data at a specific time. We create multiple snapshots on one disk image to have a multi-version of the snapshots.
Create lv.1 clone — in this step, we create a clone image from lv.1 snapshot. The cloned image has the same data as its corresponding snapshot, but it is readable and writeable. The clone gets plugged into a temporary SQL Server container. From here, it can run several processes against lv.1 clone. For example, database recovery (databases when a snapshot is in cold restoring state), bringing them online, running some validation tests, resetting login’s password, etc.
Create lv.2 snapshot — here, we take Lv.2 snapshot on lv.1 clone prepared from the previous step. At this stage, it is ready to get copied by the user.
Create image for catalog — in the final step, we create a small Alpine docker image that stores only the snapshot name created in the previous step and pushes it to our internal repository. Users can explore provided snapshots by listing available images and tags instead of directly listing snapshots from the storage cluster. For example, db_instance_a:2021.01.01 is snapshot of db_instance_a at 2021.01.01. This image is used during deployment when the user requests a copy

Request database copy

Snapshot build workflow produces snapshots, puts them on the catalogue (we expose them as image and tag), and users can choose the provided snapshots to create a copy.

The process of coping database (inside the container) is dynamically spawned on Kubernetes upon the user’s request. With Helm, we create a deployment package that is easy to deploy and dynamically configurable using templating and Values file.

When deployment starts, it deploys the init container using the specified image and tag from the catalogue to let deployment knows which snapshot the user wants to clone first. Then, it creates a new clone image from the snapshot (Lv.2) on the fly. Then it maps that new clone image into a pod that is getting deployed. Lastly, it starts an app container (database container) that uses database files mapped from newly cloned images.

Users can deploy charts with overridden configuration in the Values file to spawn a new pod and SQL server container mounted with data from the chosen clone image. They can also add this as a dependency of their application chart to build end-to-end deployment for automating CI/CD or testing pipelines.

At Agoda, we have a self-service platform, “KOrca”, to help users manually request temporary namespace or deployment for development and testing in a few clicks by choosing a pre-defined package. We also provide DB snapshots via this tool to easily request database copies. Check Developing And Testing Against a Bunch of Microservices in Agoda blog for more detail about KOrca.

KOrca provides a catalogue of our database services and available snapshots exposed as image and tags. When users make a deployment, the connection endpoint of the new copy gets produced accordingly.

Our use cases

We have implemented DB Snapshot for over 100 Terabytes of testing/QA and production SQL Server databases. The number of requests for creating copies each day we have served is 400 copies on an average and even went to a maximum of 1,200 copies!

DB Snapshot in the development life cycle

DB Snapshot has become a part of the development life cycle. Starting from development, developers can explore data or schema and make changes by requesting their copy. They can create copies along with the application or database code development branch. A copy is completely isolated exclusively for a specific branch. We can execute automated testing, and it is very convenient to run several test iterations since creating new copies is quick and easy and, we can thrash them upon task completion.

Our staging databases also use snapshot/copy images for the deployment stage. Now it does not need the effort to re-setup, reload, and roll back again. We use staging for pre-deployment testing. Every deployment package needs to pass validations on staging, e.g., pre-deployment check, schema migration check, performance validation before going live. And yes, it tests against a complete set of production data.

Multi-versions of production snapshots can also be used in the recovery scenario when someone or an application updates production data incorrectly. We can quickly spawn a copy from the snapshot with the version before the incident and fix only the affected data using the snapshot instead of complete restoration.

In conclusion

There are many more open opportunities for using database snapshots. It helps us improve the quality of testing and deployment; reduce time to maintain testing and staging environment.

This concept could be adjusted and applied to different use cases or implementations. Other Enterprise SAN or object storage also provides similar features but maybe with different performance and capabilities. It all depends on your requirements, the database you use, and your familiarity with tools and technologies.

Not just DB Snapshot, but we have multiple platforms and ideas to make development and operations better and easier. If you are interested in joining our team, check out Agoda’s Careers Page.