ScyllaDB in practice in SmartNews

SmartNews
SmartNews, Inc
Published in
13 min readMar 9, 2023

As the world’s leading news aggregator, SmartNews handles a large number of online requests and analyzes huge amounts of data every day, which has a large demand for NoSQL databases. Since SmartNews’ infrastructure is mainly built on AWS, DynamoDB became our initial choice of NoSQL database. As our business grew, the volume of data and requests to DynamoDB grew at a high rate, and with that came a rapid increase in cost. At the same time, we found that DynamoDB did not meet our needs well in many scenarios, for example, the actual request volume always failed to reach our reserved capacity under the provisioned mode, which caused a waste of resources; the schemaless data format was not conducive to our data lineage building and analysis, etc. Therefore, we have been working to find a more cost-effective multi-mode NoSQL database, and finally chose ScyllaDB.

The main reasons for choosing ScyllaDB are as follows.

  1. Simple Architecture. Scylla is a rewrite of Cassandra by former KVM members, and also uses peer-to-peer architecture, with each node running an instance of the same process, making it easy to deploy and operate and maintain.
  2. Excellent Performance. Developed in C++20, the underlying Seastar framework is specifically tailored for modern multi-core processors, and through the idea of shared-nothing, each cpu core is solely responsible for a part of the data set processing, and only a small amount of efficient communication between each other through lock-free queues, featuring high concurrency and high asynchrony. Meanwhile, ScyllaDB also provides CPU task controllers and IO controllers in user space to ensure stable and efficient operation of the system.
  3. Good Compatibility. Both CQL and DynamoDB APIs are supported, and users can choose either one of the data access methods according to the specific situation. The compatibility with DynamoDB API ensures that existing DynamoDB users can migrate with minimum cost.
  4. Active & Mature Community. ScyllaDB has reached 8.9K stars, and the community is always active, with a dedicated Slack channel and Mailing list to handle user feedback in a timely manner.
  5. Low Cost. We have done a Benchmark internally and found that ScyllaDB is able to reduce the cost by at least 50% compared to DynamoDB while meeting the same functional and performance requirements.

Due to the limited space, the specific selection details will not be elaborated in this article. This article mainly introduces the specific practice of deploying ScyllaDB in SmartNews, the technical difficulties encountered and the corresponding solutions.

K8S Deployment Solution

On EC2 or On K8S?

SmartNews chose the On K8S deployment method based on the trade-offs of maintainability and performance.

  1. The Scylla community provides production-grade Scylla CRD and Scylla Operator, and SmartNews has already built a mature K8S platform to support the deployment of internal applications. We only need to deploy the ScyllaDB-related K8S components once to make the Scylla cluster easy to manage.
    a. Scylla CRD + Scylla Operator: Manage the changes of Scylla cluster.
    b. NodeGroup: Provide nodes needed to run Scylla Pod, currently mainly AWS i3 series models.
    c. local-volume-provisioner: Used to create local-based PVs.
    When we need to operate a Scylla cluster, we just need to write the corresponding Scylla CR and submit it to K8S, and then the whole change will be handed over to the Scylla Operator, which greatly facilitates our operation and maintenance of Scylla.
  2. There is no significant difference in performance comparison. The YCSB Benchmark results show that there is no significant difference in performance between On EC2 and On K8S. The deployment architecture of On K8S is as follows.

Direct network connection based on Pod IP

  1. The increase in cost. The introduction of additional LoadBalancers increases our expenses.
  2. The impact on performance. The Scylla client SDK comes with a load balancing capability, which can evenly send requests to the nodes or even CPU cores where the data shards are located by obtaining the cluster topology information; LoadBalander will break this load balancing capability of the client and increase the network overhead between Scylla nodes.

To address these issues, we have modified the native Scylla on K8S solution.

  1. The Scylla Operator was modified to allow Scylla to directly transact services as a Pod IP.
  2. By utilizing AWS’s networking capabilities, the Pod IPs in different K8S clusters are connected through VPC Peering, so that Pods in different K8S clusters can be directly connected through Pod IPs.
Fig. 2 Enable Client LoadBalancing with Pod IPs Connectivity

The above modification can avoid the overhead of LoadBalancer and ensure the effectiveness and high performance of client load balancing.

Customization of Cluster-autoscaling

For performance reasons, ScyllaDB uses local SSD as storage media. Since the CR of ScyllaDB requests storage resources by specifying StorageClass of Local PV type, it is necessary to deploy a local-volume-provisioner (Daemonset) to support the construction of local PV, and then realize the binding of Pod and Local PV.

Ideally, we create a ScyllaDB cluster with the following steps.

Fig. 3 Steps to create a Scylla Cluster on K8S

In fact, in step 3 above, CA does not expand the new Scylla node, mainly because steps 3 and 4 form a deadlock relationship as follows: CA needs to rely on the creation of local PV to generate the expansion plan, and local-volume-provisioner depends on the creation of Scylla node to generate the local PV. Both sides are blocked by each other.

Fig. 4 Blocking Issue of Scaling out Scylla Node

The following is the standard procedure for K8S CA to trigger a scale out.

Fig 5. Node Scaling Process by CA

By reviewing the CA code, we find that before generating the node scaling plan, CA will use the Volume-Binder plugin to check if there are nodes that meet the storage requirements of the Pod, and the local PV type storage resources requested by Scylla Pod do not exist at this time, so CA will not generate the node scaling plan for it.

Fig. 6 Code Snippet of VolumeBinding Checking in K8s CA

For this reason, we modified the CA code to bypass the Volume-Binder plugin check and ensure that the Scylla deployment process can proceed smoothly.

CPU Affinity

Scylla’s excellent performance is due to its ability to bind resident worker threads (reactors) to CPU cores, thus avoiding frequent context switches. Ideally, each worker thread is fixed to a single CPU core to continuously process its own set of tasks and datasets.

Fig. 7 Code Snippet of CPU Binding in Scylla

In order to support CPU affinity on K8S, we upgraded the K8S version and enabled the Static CPU Policy of the node, so that the on K8S deployment solution achieves almost the same high performance as the on ECS solution. Here is the 1:1 mapping between reactor and cpu core on a scylla node.

Fig. 8 Shard Per CPU Core

Monitoring & Alerting

SmartNews has built its own Prometheus/Grafana/Loki based distributed monitoring and alerting platform based on the open source ecosystem. scyllaDB is not only compatible with Prometheus metrics collection, but also provides a complete Grafana Dashboards to display multi-dimensional metrics information. The following is an extract from the ScyllaDB monitoring page inside SmartNews.

Fig. 9 Scylla Detailed Dashboard
Fig. 10 Scylla Overview Dashboard

We also customized the SLO monitoring dashboard to track the cluster availability metrics in real time.

Fig. 11 Scylla SLO Dashboard

Chaos Project

SmartNews has built its own Chaos Mesh platform, which can be used to perform walkthroughs for various failure scenarios.

Fig. 12 Chao Mesh Portal

The following are examples of random failure scenarios of Scylla nodes for reference. Through monitoring, we can see the trend of system load, number of connections, read/write throughput and latency of Scylla when the node fails. The results show that in the On K8S scenario, the system is able to provide normal service even with a small number of node failures in the (5K/s, 20K/s) traffic range, and the system performance usually returns to normal within <6 minutes.

Fig. 13 Monitoring of Random Pod Kill Scenario

In addition to node failures, we also perform scenarios such as disk failures, network partitioning, etc., which will be covered in more detail in a future blog.

Upgrading High Availability Deployments Across K8S

Availability Issues for Single K8S Deployments

Initially, we deployed Scylla in a single K8S cluster, and once the K8S cluster became unavailable, the Scylla service would be affected as well. More importantly, since Scylla uses a local disk, if the K8S cluster is unavailable for more than a certain length of time, it will likely result in the Scylla Node being reclaimed by CA, resulting in data loss.

The K8S cluster used by SmartNews is actually an AWS EKS product, and as you can check from the AWS website, EKS is only able to provide 99.95% availability. As a storage product, we are committed to providing >=99.99% availability, and a single K8S cluster deployment is clearly not enough to meet the demand. In order to improve availability, we upgraded our deployment solution to a cross-K8S cluster deployment.

Fig. 14 SLA of AWS EKS

Deploying across K8S Clusters with Multi-DC Features

Scylla’s DCs are a logical concept, so we can combine multiple Scylla clusters into one large cluster across multiple logical DCs as long as the cluster names are the same, the corresponding DC names are unique, and the network connectivity and application awareness between the clusters are guaranteed. By placing these logical DCs in different K8S clusters, we can deploy across K8S clusters (of course, we can also use this solution to deploy across physical DCs as needed).

Fig. 15 Cross K8S Scylla Cluster

The specific deployment process is as follows.

  1. Deploy the Scylla Cluster in different K8S clusters with the same name and cluster configuration, but with different DC names.
  2. When deploying Scylla Cluster, specify the seed node as the node IP of the existing cluster, so that the clusters of different DCs can be aware of each other at the application level.
  3. Create Keyspace (similar to database in mysql) according to business requirements and configure the number of replicas of different DCs at the keyspace level to ensure high availability at the data storage level.

After the cluster is created, you can check the topology through the nodetool status command that comes with Scylla, and the following is the topology information of a multi-DC deployment cluster. These DCs are distributed across multiple K8S clusters.

Fig. 16 Nodetool Status Result

Load Balancing Policy across K8S Clusters

Only service side deployment architecture changes are not sufficient to achieve high availability across K8S, because

  1. Automatic failover in case of failure is not possible. The SDK of ScyllaDB needs to specify the Local DC, and if the current Local DC fails, the client configuration needs to be modified to specify another healthy DC as the Local DC and redeploy the client application. There are service unavailability periods in the whole operation process, and the operation is complicated.
  2. Cost Issues. ScyllaDB comes with an SDK that only supports load balancing within a single DC, which causes other DCs in the same cluster to be in a cold standby state, resulting in idle and wasted resources.

For this reason, we modified the client SDK to support traffic routing and distribution across multiple DCs. The client can configure the DC set to achieve cross-DC load balancing of traffic; the client can also monitor the node status in real time and kick out unhealthy nodes from the load balancing node list to achieve automatic transfer of failures.

Fig. 17 Load Balancing Strategy for Cross K8s Deployment

The following is the actual effect of cross-cluster load balancing: initially there are two DCs, each taking 50% of the online traffic, and when DC1 fails at 15:02, you can see that DC1’s traffic is automatically switched to DC2, and when DC1 recovers at 15:07, the traffic continues to be shared equally between the two DCs.

Fig. 18 Load Balancing and Failover between 2 Logical DCs

Note that the above load balancing scheme is mainly for multiple logical DC clusters deployed in the same physical DC. If the user’s requirement is to deploy across physical DCs, then we can also avoid the network overhead and high latency across physical DCs by specifying that the client application only accesses the scylla cluster in the same physical DC.

Resource Isolation

Different workloads have different requirements for Scylla clusters, such as some focus more on high throughput, some focus more on low latency; some read more and write less, some write more and read less; some pay more attention to no data loss and need more replicas, some can scale down the number of replicas for performance; some have higher consistency requirements, somewhat lower, etc. If these loads all run on the same machine resources, there will be interference with each other.

For example, news and advertising are the two main businesses of SmartNews, and both news recommendation and ad placement need to use feature data to implement the corresponding algorithm strategy. We have encountered the problem that the large amount of feature data written to the news business seriously affects the read latency of the feature data of the advertising business, and even though the data required by them do not overlap, there is still fierce competition for computation and storage resources.

Through the deployment method based on logical DCs, we can divide different businesses into different logical DCs, making them access their respective sub-clusters separately, thus achieving resource isolation. As shown in the figure below, we can not only isolate the critical business resources through different logical DCs, but also share the non-critical business resources, and improve the high availability of the cluster.

Fig. 19 Resource Isolation for Ads and News on Scylla

Data Backup and Service Degradation

Since the local disk has much higher IO performance than the cloud disk, we would prefer to use the local Scylla cluster for online services, but the disadvantage of the local disk is that the data on the local disk will be lost if the node fails.

We can solve the data backup problem by dividing the Scylla clusters using the cloud disk (EBS) into a separate logical DC. Normally, these clusters do not provide query services to the outside world, but only synchronize the written data in real time. Because the data is written by append, even EBS can meet our writing needs in most scenarios. Once a major accident occurs that causes the nodes to be unavailable, as the data of the nodes using the cloud disk is still there, when the failure is recovered, the Scylla nodes using the cloud disk can quickly start and provide external degradation services, while other local disk nodes are able to recover data through the cloud disk nodes and continue online services after the data is fully recovered.

The following figure shows the ebs-dc3 cluster that provides data backup and downgrade services.

Fig. 20 Deployment with Backup and Degrading Features

In the figure below, DC1 and DC2 are scylla clusters using local disk, and DC3 is a scylla cluster using cloud disk. You can see that online query requests are not sent directly to DC3, but data writes are always synchronized between the three DCs.

Requests from the client will only be sent to the scylla cluster using the local disk.

Fig. 21 Client Traffic Monitoring to ScyllaCluster with 2 Online Logical DCs and 1 Backup Logical DC

Data writes are synchronized between the three clusters (the green line spikes at 15:07 because DC1 is recovering from the failure and needs to synchronize the updated data from the other DCs during the failure)

Fig. 22 Server Replica Traffic Monitoring of ScyllaCluster with 2 Online Logical DCs and 1 Backup DC

Summary and Outlook

In addition, we also encountered code level issues such as Scylla Operator’s Crash, Scylla’s 40ms Delay latency, system optimization issues such as parameter tuning, Compaction policy selection, high availability, etc. These will be introduced in more detail in subsequent articles.

In the future, we will continue to deepen and expand the application of ScyllaDB in the following directions.

  1. Provide more choices of instance models. The models we currently use are still based on X86 architecture, and we will try ARM architecture models in the future.
  2. More accurate SLA classification and control within a single cluster based on the open source version.
  3. Scylla officially provides scylla-manager, a service for daily operation and maintenance of the application layer, which can automatically backup, repair and monitor data. However, the protocol of the open source version restricts the management of up to 5 nodes. We will develop our own service to automate the related operations.
  4. PII (personally identifiable information), compliance management of sensitive user data will be one of our important directions in the future.
  5. GraphDB support. With ScyllaDB as the base, finding a suitable graph database solution to meet the needs of SmartNews graph relational data is also the direction we will try.

--

--