Azure Data Factory — CosmosDB Backups

santhosh kumar
Nov 4 · 4 min read

As part of day to day job, I have a requirement to look at CosmosDB backups. For the beginners, Azure Cosmos DB is a fully managed database service with turnkey global distribution and transparent multi-master replication. It provides unlimited scalability and provides multiple APIs to access data from your applications (SQL, MongoDB, Cassandra, Spark to name a few). We will discuss Cosmos DB features in another post. For this post, we will look at native backups provided by Azure and other backup solutions we can use.

Azure cosmos DB backups data once in every four hours and at any point of time only last two backups are retained. If one notice corruption or stale data or accidental data deletion, they should inform/ask Microsoft before this timeframe to get the data back. This does not meet the requirement of critical business applications where loosing data is disastrous. Fortunately, Azure provide other ways to backup Cosmos DB and every one running production applications on Cosmos should use these to backup their data regularly to avoid any surprises.

Backing up CosmosDB:

Azure provide two methods for backing up cosmos.

  1. Azure Cosmos DB Change Feed.
  2. Azure Data Factory.

Change feed support in Azure Cosmos DB works by listening to an Azure Cosmos container for any changes. It then outputs the sorted list of documents that were changed in the order in which they were modified. The changes are persisted, can be processed asynchronously and incrementally, and the output can be distributed across one or more consumers for parallel processing.

Azure Data Factory is Azure’s cloud ETL service for scale-out serverless data integration and data transformation. It offers a code-free UI for intuitive authoring and single-pane-of-class monitoring and management. We chose to user Data Factory as change feed do not work in our case as it stores only last updated value irrespective of how many updates happened on a specific item which do not solve our purpose of restoring back to point in time.

Concepts:

  1. Linked Service: Linked services are much like connection strings, which define the connection information needed for Data Factory to connect to external resources. External Resources can be external to Azure or Azure resources.
  2. Datasets: Dataset is a named view of data that simply points or references the data you want to use in your activities as inputs and outputs. Datasets identify data within different data stores, such as tables, files, folders, and documents. For example, an Azure Blob dataset specifies the blob container and folder in Blob storage from which the activity should read the data or cosmosdb source.
  3. Integration Runtime: The Integration Runtime (IR) is the compute infrastructure used by Azure Data Factory for executing activities mentioned in pipeline. Azure provides native integration runtime that is managed by Azure. As a customer, we don’t need to provision or manage this infrastructure. This is analogous to Serverless functions. All resources that are publicly accessible can use this. If one has some resources that are not publicly accessible, customers should create their own integration runtime. It can be installed on Windows server and registered with Data Factory. Make sure that infrastructure provisioned can handle the copy operations you created.
  4. Pipeline: A pipeline is a logical grouping of activities that together perform a task. For example, a pipeline could contain a set of activities that ingest and clean log data, and then kick off a Spark job on an HDInsight cluster to analyze the log data. The beauty of this is that the pipeline allows you to manage the activities as a set instead of each one individually. For example, you can deploy and schedule the pipeline, instead of the activities independently.
  5. Pipeline runs: A pipeline run in Azure Data Factory defines an instance of a pipeline execution. For example, say you have a pipeline that executes at 8:00 AM, 9:00 AM, and 10:00 AM. In this case, there are three separate runs of the pipeline, or pipeline runs. Each pipeline run has a unique pipeline run ID.

Procedure:

We need to provide Cosmos DB credentials for reading the data in pipeline and write to either cosmos DB (for replication) or to storage blob (for long term retention). Providing them inline in script is not good practice. Since Azure provides keyvault already, we should leverage this to store credentials and call from our script.

Azure datafactory do not have support for incremental backups natively. With mongodb, if one creates columns with creation timestamp in each records, it can be used for incremental backups. However, if one do not have that, objectidrecord contains timestamp of record creation. Here is the composition of objectidrecord.

The 12-byte ObjectId value consists of:

  • a 4-byte value representing the seconds since the Unix epoch,
  • a 5-byte random value, and
  • a 3-byte counter, starting with a random value.

So, we can leverage first 4 bytes of objectid to identify the timestamp when the object is created.

I have prepared python scripts using Azure data factory SDK, They are here.
There are two scripts, one for replicating cosmos db and other for copying cosmos db to azure storage. This script converts current timestamp to string object and timestamp of 4 hours ago into string and backups all objects created between these two objectids for incremental backups.

README contains detailed explanation on how to use them.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade