Riak: not your average KV Store

Published in

The Ksquare Group

5 min readMay 10, 2019

Author: Jaqueline Caamal & Mike Uc

Think of a key-value store as a map or a dictionary. It works as a simple database that uses an associative array to connect keys with a unique value in a collection. No matter the size of the data in the collection, KV stores are useful at processing a constant stream of read / write low latency operations.

Sometimes, companies look for data models to manage customer data records, store subscriber sessions or user preferences, and behavioral data. For rapidly accessing this massive amount of data and constant process stream, did you know an ideal solution exists in key-value storage?

Key-value storage allows data to be easily distributed among several database servers, which also turns out are good for “data-intensive” applications. A-M-A-Z-I-N-G right? The main feature of a key-value store is that it’s simple, but really fast: data is stored in a basic format kv (key-value) structure, while the store is ignorant of the content of the value part. Now let’s talk about a specific kv store: Riak KV.

Why is Riak KV special?

Riak KV is specifically designed to address the problem of data availability with advanced local and multi-cluster replication, that happens to guarantee reads and writes, even in the event of hardware failures or network partitions. It has a non-master architecture that makes it easy to add and remove nodes using basic hardware.

Although, we have to be specific about this:

Riak KV can handle millions of keys, petabytes of storage, millions of users, and billions of data points.

Need proof of its capabilities?

For this experiment, we are going to use an interesting project that I found from Yahoo! The goal was to develop a framework and common set of workloads for evaluating the performance of different key-value stores (for more information, I will include the link below). In this case, the example uses workload D (workload A, B, C, E, and F show other types of data). Workload D focuses on the “Read Latest Workload”. In this workload, new records are inserted, and the most recent is the most popular. We can see this on applications where you want to know user status update or locate people who want to read the latest posts.

For the workload execution, the test uses these parameters:

threads: Number of threads. In this example (where Yahoo! provided the parameters, and we the quantities) we set it on 10, because we are going to increase the amount of load data and we want to make a homogeneous experiment.
target: Number of operations per second. We set this parameter on 100 to take 100 milliseconds on average for each operation. Or instead, to do about 10 operations per second per worker thread.
s: Status report. This status report will help us to see the statistics about the performance.
P: Load the workload parameter file.
measurementtype: We specify the time series parameter to report the average latency of each interval.
p: Set specific parameters.
timeseries.granularity: We specify 200 of granularity which means readings every 200 milliseconds. This was divided into small subintervals as we need to approach each time to the true value of latency.

The command we used.

Drumroll for the final results

In our experiment, we increased the number of record counts to prove that the performance is similar in every sample.

The following table shows a standard deviation (SD) and 95% confidence intervals of each record count.

As we can see, the lowest latency averages were presented for 1,000,000 and 100,000,000 sample sizes respectively, and the highest latency averages were presented for 10,000,000 and 1,000,000,000. However, we can conclude from this benchmark that the four different record counts have a similar profile and behavior.

For example, gaming companies use Riak KV to store session data for players; retail or e-commerce extend the use to stock user profiles, preferences, and behavior data; while telecommunications use this specific kv storage to manage customer data records and save subscriber session information for mobile and web applications. Implementing a bigger quantity of data on Riak KV makes the benefit of migration more significant in terms of the needs of the application.

Riak KV gives us significant scalability and good performance of reading, despite the massive increase in records.

To be fair to Riak, it does more than just great performance at scale. As a complete kv database, the realized value becomes more apparent if we were to enlist all the features it offers. But, to cut to the chase: no matter how big you think your data is, if blazingly fast read/access + scale is important to your business, Riak KV may be the solution you’re looking for.

This article is just a descriptive report of a benchmark.

Git repo:

brianfrankcooper/YCSB

Yahoo! Cloud Serving Benchmark. Contribute to brianfrankcooper/YCSB development by creating an account on GitHub.

github.com