Automated cluster management and recovery for Rocksplicator

Bo Liu | Engineering Manager, Serving Systems team

Rocksplicator at Pinterest

Rocksplicator (github) was open sourced by Pinterest one and half years ago. It was designed to solve common problems of building large-scale online stateful data systems on commodity hardware, such as RocksDB replication and request routing. Technical details about how Rocksplicator solves these problems can be found in the blog post Open-sourcing Rocksplicator, a real-time RocksDB data replicator.

Since two years ago when the first Rocksplicator powered system started to run in production, a small team of engineers built nine different systems on top of Rocksplicator. They’re now running on thousands of hosts and serving tens of millions of queries per second in production. These systems are critical components powering Pinterest products. Here are a few examples of how these critical components are powering Pinterest products:

  • Some of them collectively changed Pinterest newsfeed from push-based model to a pull-based model and made it responsive to user actions in real-time.
  • One of the systems is deriving over 50 million online ML inferences per second to power Ads, newsfeed, related Pins and more.
  • Other systems are tracking site events for Pin deduplication and various types of ML real-time signals.

Operational burden started to be a problem

Originally we managed our clusters by manually running scripts. It was error prone and became a big burden for the team as Rocksplicator started to run on more and more hosts. We knew that it was the time to implement automated cluster management and recovery for Rocksplicator to replace the original script driven method. We built an in-house cluster management prototype system and also did some research on existing open source solutions. Eventually we decided to adopt Apache Helix, which was open sourced by Linkedin. This was exactly what we were looking for as Helix was designed as “a cluster management framework for partitioned and replicated distributed resources”. However, we couldn’t find much detailed information about how to apply Helix to Master-Slave real-time replicated storage-like systems. We had to ask questions in Helix mailing list, look into Helix docs and code to find out available features and build our own logics on top of what Helix provides. This post is to share our experiences and lessons learned.

Overall architecture

Figure 1. Overall Architecture

Figure 1. depicts the overall architecture of how Helix works with Rocksplicator powered services. Zookeeper sits in the center which stores resource mappings, configurations, messages between Helix controller and Rocksplicator, etc. All Helix components are stateless. This design is nice as it delegates critical data management to Zookeeper, which has been proven to be highly reliable. Helix controller constantly monitors events happening in the cluster, which include configuration changes, joining and leaving of Rocksplicator hosts, etc. Based on the latest status of the cluster, Helix controller computes an ideal state of resource mappings and sends messages to Rocksplicator services through Zookeeper to gradually bring the cluster into the ideal state. Each Rocksplicator host maintains a connection to Zookeeper to let others know its liveness. In the meantime, it pulls messages from Helix controller through Zookeeper, and changes its local states based on the messages received. Rocksplicator employs Master-Slave Helix state model for online mutable clusters and Online-Offline Helix state model for batch updated or cache-like clusters. This post will focus on the Master-Slave model as it’s more challenging to implement. Helix UI and Restful API are used to peek cluster status and change cluster configurations.

Apply Helix to Rocksplicator

The first problem we needed to solve was that Helix is a Java framework while Rocksplicator is in C++. Helix provides a standalone agent which can be used as a bridge between non-Java services and Helix. We didn’t take this approach as it requires careful design for service and agent process lifecycle management. Instead, a JVM is embedded into Rocksplicator service process through JNI to run Helix library. This single process approach is simpler and more reliable than the agent based solution.

In order to fully leverage the power of Helix, we decided to run it in FULL_AUTO mode, in which Helix decides both the location and the state of every data partition. To efficiently and safely run Rocksplicator services with FULL_AUTO mode, several Helix configurations are critical to set. First of all, settings DELAY_REBALANCE_ENABLE and DELAY_REBALANCE_TIME need to be set properly so that Helix will not move shards away from an offline host too soon. Because hosts need to temporarily go offline during a deploy and moving data partition between hosts is expensive for storage-like systems. On the other hand, we want to override this behavior for safety when the available replicas for a partition in the cluster is too low. Therefore we also need to set MIN_ACTIVE_REPLICAS properly. Note that all these configurations apply to partition movement only. Helix is free to change partition states (Master, Slave, etc) immediately when a host goes offline so that the number of write failures is minimized. Another important configuration is MAX_OFFLINE_INSTANCES_ALLOWED. When the number of offline hosts is above this threshold, Helix puts the managed cluster into a maintenance state in which no partition movement is allowed. This helps us protect our data during a disaster such as network partition or software bug caused service crashes.

Rocksplicator changes the states of its local data partitions purely based on state transitioning messages received. There are two different scenarios where an “OFFLINE -> SLAVE” message would be sent. One case is when a partition is being moved to a host for the first time. The other case happens when a Rocksplicator service process restarted for a deploy and is coming back online. Unfortunately, Helix doesn’t have a support to differentiate between these two scenarios. Rocksplicator could handle both cases in a unified way by firstly restoring a backup of the partition to local DB and then starting to replicate updates from the Master. Obviously, it’s unnecessary for the second scenario. To fix it, we added logics in Rocksplicator to check if a local DB requires rebuild when changing it from OFFLINE to SLAVE state. This is achieved by checking the timestamp attached with the most recent update in local WAL and comparing the latest local sequence number with that on remote replicas.

Rocksplicator typically has multiple replicas for each data partition. When changing the state of a local replica, Rocksplicator needs to know the states of other replicas in the system so that it could setup the replication topology correctly. For example, when changing a replica from OFFLINE to SLAVE state, Rocksplicator needs to set up its local DB to replicate updates from Master or another SLAVE if no Master exists. The fact that Helix may change the states of multiple replicas simultaneously makes it almost impossible for Rocksplicator to reliably figure out the current states of other replicas. To solve this problem, we tried to use distributed Zookeeper locks to synchronize state transitions for replicas of the same partition. Unfortunately, this alone is not enough to fully solve the problem. After a Rocksplicator process finishes a state transition and releases the lock, another process may enter a state transition handler and acquire the lock before Helix controller exposes the state change. This race makes it possible that the state of a remote replica observed inside a transition handler is either its current state or previous state. We had to carefully design the state transition logics in Rocksplicator to be resilient to either cases.

To improve data availability, Rocksplicator distributes data replicas in different Availability Zones or Placement Groups. Helix has a nice concept called TOPOLOGY to help us implement this data distribution. One thing to note is that we needed to change our deploy system to deploy Rocksplicator services zone by zone or group by group. Without this support, we could only deploy with a very small step size. Otherwise, we may have had some partitions with replica number lower than MIN_ACTIVE_REPLICAS, which would make Helix unnecessarily bring up new replicas or even cause data loss.

A partition mapping file is generated for every Rocksplicator cluster based on the external views exposed by Helix. This file is needed to route traffic to Rocksplicator services. We don’t use Helix Spectator to route traffic because we have clients built in languages other than Java and we want to reuse our internal configuration propagation system. To reduce service deploy impacts, we exclude hosts being deployed/restarted from this file.

Helix managed Rocksplicator clusters at Pinterest

Helix managed Rocksplicator services have been running in production for months now which has greatly reduced the operation burden on the team and the operation errors we made. We are in the process of migrating the rest of Rocksplicator services onto it. By combining it with AWS ASG (Auto Scaling Group), we can replace a dead host in a stateful cluster automatically. To add more capacity to a cluster, we simply bump the ASG limit. We’re very happy with the decision of adopting Helix for Rocksplicator.

Acknowledgements: We’d like to thank the Helix team for open sourcing this great cluster management framework and answering our questions in the mailing list. We also want to thank Suli Xu from CMP team at Pinterest for adding the group by group deploy policy to our deploy system Teletraan.

Pinterest Engineering Blog

Inventive engineers building the first visual discovery engine, 200 billion ideas and counting.