by Praveen Sadhu, Vijay Parthasarathy & Aditya Jami
We talked in the past about our move to NoSQL and Cassandra has been a big part of that strategy. Cassandra hit a big milestone recently with the announcement of the v1 release. We recently announced Astyanax, Netflix’s Java Cassandra client with an improved API and connections management which we open sourced last month.
Today, we’re excited to announce another milestone on our open source journey with an addition to make operations and management of Cassandra easier and more automated.
As we embarked on making Cassandra one of our NoSQL databases in the cloud, we needed tools for managing configuration, providing reliable and automated backup/recovery, and automating token assignment within and across regions. Priam was built to meet these needs. The name ‘Priam’ refers to the king of Troy, in Greek mythology, who was the father of Cassandra.
What is Priam?
Priam is a co-process that runs alongside Cassandra on every node to provide the following functionality:
- Backup and recovery
- Bootstrapping and automated token assignment.
- Centralized configuration management
- RESTful monitoring and metrics
We are currently using Priam to manage several dozen Cassandra clusters and counting.
Backup and recovery
A dependable backup and recovery process is critical when choosing to run a database in the cloud. With Priam, a daily snapshot and incremental data for all our clusters is backed up to S3. S3 was an obvious choice for backup data due to its simple interface and ability to access any amount of data from anywhere.
Priam leverages Cassandra’s snapshot feature to have an eventually consistent backup. Cassandra flushes data to disk and hard-links all SSTable files (data files) into a snapshot directory. SSTables are immutable files and can be safely copied to an external source. Priam picks up these hard-linked files and uploads them to S3. Snapshots are run on a daily basis for the entire cluster, ideally during non-peak hours. Although snapshot across cluster is not guaranteed to produce a consistent backup of cluster, consistency is recovered upon restore by Cassandra and running repairs. Snapshots can also be triggered on demand via Priam’s REST API during upgrades and maintenance operations.
During the backup process, Priam throttles the data read from disk to avoid contention and interference with Cassandra’s disk IO as well as network traffic. Schema files are also backed up in the process.
When incremental backups are enabled in Cassandra, hard-links are created for all new SSTables in the incremental backup directory. Priam scans this directory frequently for incremental SSTable files and uploads them to S3. Incremental data along with the snapshot data are required for a complete backup.
Compression and multipart uploading
Priam uses snappy compression to compress SSTables on the fly. With S3’s multi-part upload feature, files are chunked, compressed and uploaded in parallel. Uploads also ensure the file cache is unaffected (set via: fadvise). Priam reliably handles file sizes on the order of several hundred GB in our production environment for several of our clusters.
Priam supports restoring a partial or complete ring. Although the latter is less likely in production, restoring to a full test cluster is a common use case. When restoring data from backup, the Priam process (on each node) locates snapshot files for some or all keyspaces and orchestrates the download of the snapshot, incremental backup files, and starting of the cluster. During this process, Priam strips the ring information from the backup, allowing us to restore to a cluster of half the original size (i.e., by skipping alternate nodes and running repair to regain skipped data). Restoring to a different sized cluster is possible only for the keyspaces with replication factor more than one. Priam can also restore data to clusters with different names allowing us to spin up multiple test clusters with the same data.
Restoring prod data for testing
Using production data in a test environment allows you to test on massive volumes of real data to produce realistic benchmarks. One of the goals of Priam was to automate restoration of data into a test cluster. In fact, at Netflix, we bring up test clusters on-demand by pointing them to a snapshot. This also provides a mechanism for validating production data and offline analysis. SSTables are also used directly by our ETL process.
Priam automates the assignment of tokens to Cassandra nodes as they are added, removed or replaced in the ring. Priam relies on centralized external storage (SimpleDB/Cassandra) for storing token and membership information, which is used to bootstrap nodes into the cluster. It allows us to automate replacing nodes without any manual intervention, since we assume failure of nodes, and create failures using Chaos Monkey. The external Priam storage also provides us valuable information for the backup and recovery process.
To survive failures in the AWS environment and provide high availability, we spread our Cassandra nodes across multiple availability zones within regions. Priam’s token assignment feature uses the locality information to allocate tokens across zones in an interlaced manner.
One of the challenges with cloud environments is replacing ephemeral nodes which can get terminated without warning. With token information stored in an external datastore (SimpleDB), Priam automates the replacement of dead or misbehaving (due to hardware issues) nodes without requiring any manual intervention.
Priam also lets us add capacity to existing clusters by doubling them. Clusters are doubled by interleaving new tokens between the existing ones. Priam’s ring doubling feature does strategic replica placement to make sure that the original principle of having one replica per zone is still valid (when using a replication factor of at least 3).
We are also working closely with the Cassandra community to automate and enhance the token assignment mechanisms for Cassandra.
For multi-regional clusters, Priam allocates tokens by interlacing them between regions. Apart from allocating tokens, Priam provides a seed list across regions for Cassandra and automates security group updates for secure cross-regional communications between nodes. We use Cassandra’s inter-DC encryption mechanism to encrypt the data between the regions via public Internet. As we span across multiple AWS regions, we rely heavily on these features to bring up new Cassandra clusters within minutes.
In order to have a balanced multi-regional cluster, we place one replica in each zone, and across all regions. Priam does this by calculating tokens for each region and padding them with a constant value. For example, US-East will start with token 0 where as EU-West will start with token 0 + x. This allows us to have different size clusters in each regions depending on the usage.
When a mutation is performed on a cluster, Cassandra writes to the local nodes and forwards the write asynchronously to the other regions. By placing replicas in a particular order we can actually withstand zone failures or a region failures without most of our services knowing about them. In the below diagram A1, A2 and A3 are AWS availability zones in one region and B1, B2 and B3 are AWS availability zones in another region.
Note: Replication Factor and the number of nodes in a zone are independent settings for each region.
All our clusters are centrally configured via properties stored in SimpleDB, which includes setup of critical JVM settings and Cassandra YAML properties.
Priam’s REST API
One of goals of Priam was to support managing multiple Cassandra clusters. To achieve that, Priam’s REST APIs provides hooks that support external monitoring and automation scripts. They provide the ability to backup, restore a set of nodes manually and provide insights into Cassandra’s ring information. They also expose key Cassandra JMX commands such as repair and refresh.
For comprehensive listing of the APIs, please visit the github wiki here.
Key Cassandra facts at Netflix:
- 57 Cassandra clusters running on hundreds of instances are currently in production, many of which are multi-regional
- Priam backs up tens of TBs of data to S3 per day.
- Several TBs of production data is restored into our test environment every week.
- Nodes get replaced almost daily without any manual intervention
- All of our clusters use random partitioner and are well-balanced
- Priam was used to create the 288 node Cassandra benchmark cluster discussed in our earlier blog post.
- Amazon S3
- Datastax blog on backup and restore
- Netflix Cassandra benchmark
- Cassandra replication
- Priam on Github
Originally published at techblog.netflix.com on February 21, 2012.