Redis is an in-memory data structure server that is widely popular¹. Recently, (Oct 2019) ScyllaDB’s master branch accepted contributions supporting² basic Redis protocol³ from Peng Jian⁴ . Naturally, one would like to compare the cost of running such an alternative caching solution on cloud. This blog compares ScyllaDB’s Redis API with AWS ElastiCache with a focus on reducing cloud TCO for highly available and high capacity (TB) cache. AWS ElastiCache is a fully managed Redis provided by Amazon.
Redis is fast, but has limitations due to:
- Single threaded process: While Redis multithreaded in the form of having multiple IO threads support is in progress, this is not same as the Seastar⁵ architecture that is fundamentally focused on performance on modern, multi core systems and thereby scales well.
- Basic on disk/SSD persistence provided via RDB and/or AOF based snapshots. Storage/Persistence is central to ScyllaDB’s design. Further, ScyllaDB provides multi replication and multi data center support out of the box. Similar support on Redis is available only through Redis Enterprise⁶.
- Compared to Redis, Scylla can serve requests out of fast local storage such as NVMe thereby offering higher density.
The Redis support on ScyllaDB however, is currently limited only to basic APIs such as Put/Get/Delete/Ping etc. Nevertheless, these APIs are robust enough having tested millions of ops. Being open source, it is likely that additional APIs support will be added by the community.
Redis Labs published a high throughput benchmarking tool for Redis called Memtier. The numbers presented here are based on using memtier as a load generator. Memtier is chosen due to the following reasons — (i) memtier is a benchmark maintained and supported by Redis Labs, the authors of Redis; (ii) memtier benchmark is designed to generate high throughput load using a wide variety of combinations such as clients, threads, and multiple connections per thread; and (iii) memtier’s report includes throughput and latency profiles.
AWS ElastiCache has several use cases such as caching, maintaining gaming leaderboards, pub-sub for chat based apps, session store, running ML model scoring to name a few. From an end users perspective, these services are real-time, responsive, and “always on”. Hence the need for a high throughput, low latency memory based store that can scale and is highly available.
Consider a use case, where one needs to cache 1 Billion objects. Further, each of these objects is 1KB similar to storing a session store, status page, etc. At 1 Billion items of 1 KB size, we need a cache of 1 TB. Further, in a production environment, such a cache needs to be available and hence the cache data needs to be replicated. For this use case of object cache, assume that we need a throughput SLA of 100K ops/sec , <10 msec average latency while providing availability (via replication) across two zones in AWS. Note that this use case falls under the category of a “capacity” cache with “replication” in order to guarantee a fixed SLA. This workload is modeled based on the following memtier parameters consisting of 50 clients each with 20 threads.
AWS ElastiCache provides the ability to replicate the cache across availability zones. For a cache capacity of 1TiB, AWS ElastiCache requires ~1.6 TB of total memory because of the overhead of caching metadata. Using 8x cache.r4.8x instances, each of which has ~200 GB RAM, the total cache capacity equals ~1.6 TB. Further, for High Availability, we double the number of instances. Therefore, for AWS ElastiCache, a replication group with 8 instances across two availability zones leads to a total of 16x cache.r4.8x.large instances⁷.
For ScyllaDB, the cache capacity is limited by NVMe Storage. A single i3.8xlarge instance provides 1.9 TiB of cache. Running this in cluster mode across two availability zones results in 2x i3.8xlarge instances⁸. The client benchmarking tool, memtier, runs on a c5.9xlarge instance. Details of memtier parameters and AWS instance details are below:
Memtier client ran a load against both AWS ElastiCache and ScyllaDB. Details of the script and notes on running are available on Github⁹. First a cache of 1 Billion items was populated followed by running a 100% Get workload and 80:20 Get:Put mixed workload running for half hour. For both ScyllaDB and AWS ElastiCache, a throughput of 100K and an average latency < 10 msec was attained. However, based on cloud cost ScyllaDB’s cost is at least 12x lower at 1 TiB.
To summarize, we see the following benefits of using a ScyllaDB based Redis as opposed to AWS ElastiCache. Note that the cost savings of using ScyllaDB increases non-linearly as the cache capacity increases. This is because the capacity increase for ScyllaDB leaps by capacity offered by SSD instance and not RAM.
: Redis DB rankings https://db-engines.com/en/system/Redis
: Redis API Pull Request https://github.com/scylladb/scylla/pull/5132
: Redis API in Scylla https://github.com/scylladb/scylla/blob/master/docs/redis/redis.md
: Seastar http://seastar.io/
: Redis Enterprise Features https://redislabs.com/redis-enterprise/technology/
: AWS ElastiCache High Availability https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/Replication.html
: ScyllaDB cluster mode on a single DC https://docs.scylladb.com/operating-scylla/procedures/cluster-management/create_cluster/
: Benchmarking script and notes on AWS https://github.com/githubsid/scylla-redis