Operating the Riak KV cluster.

Ivan Tuzhilkin
Sep 2, 2019 · 6 min read

When we are talking about NoSQL databases, we expect high availability and linear scalability to work out of the box. We also assume that the database writes scaling shouldn’t be the problem for the NoSQL database clusters. However, not all that we expect works well, or sometimes just works terrible and takes too much time to carry out routine tasks.

Riak is a distributed NoSQL key-value data store that offers high availability, fault tolerance, operational simplicity, and scalability. Wiki

All of the above is entirely true.

There is plenty of information about the use cases and successful implementation of Riak out there, but I would like to tell you about our experience. Our use case is game server backend; the data we store in Riak is gamers data such as profiles and in-game progress saves. Currently, we have several Riak clusters one of them consist of 20 nodes and contain about 5 TB of data.

Is Riak exceptional? Let’s review some of the Riak’s benefits.

Operational simplicity

It just works, and this is true.

The typical Riak cluster is composed of identical nodes and data is automatically replicated across cluster nodes. So, in general, Riak nodes can be added or removed “automagically” completely painlessly, and the database data is rebalanced automatically in a non-blocking operation.

### Join existing cluster, any cluster node can be picked to join.riak-admin cluster join riak@node-01.name### Riak creates a plan for how vnodes should be rebalanced across the cluster.riak-admin cluster plan### Finally, commit to starting the join operation.riak-admin cluster commit

Since every cluster’s node is identical, there is no single point of failure or bottleneck. If machine failure occurs, Riak detects it and recovers the node consistency when a machine gets back online.

### When a node joins the cluster or node get back online, the handoffs can be observed with commandriak-admin transfers

The problems that we had in operating Riak clusters.

Out of disk space. Bitcask is the Riak’s default storage backend, may consume a lot of disk space and frees it up only after Riak restarts, but not always (sometimes it just really a lot of data). LevelDB doesn’t have such a problem.

Since we do not collect and store personal information about players, we don’t delete the data from Riak at all. But, as far as I know, the process of removing data from Riak is quite complicated. In my experience LevelDB never releases disk space. Even when the part of data is moved to other nodes during cluster reconfiguration or even when the node left the cluster, in that last case, LevelDB data directory should be cleaned manually.

Out of RAM (Swapping). Bitcask keeps all index keys in memory, so have to be sure that there enough RAM on your servers. Though LevelDB doesn’t rely on RAM too much, it’s always better to have more than less.

We use LevelDB by default on all current projects and new projects.

Despite some tradeoffs, operating Riak is straightforward and reliable, definitely one of the best solutions I’ve ever used before.

Intelligent replication

No time for downtime?

Riak 2.2.5, released April 26, 2018, is the first community release. Added support for Multi-Datacentre Replication which was not part of open-source Riak before, added a grow-only set data type, improved data distribution over nodes and cleaned up production test issues.

Riak replicates key/value stores across a cluster of nodes with a default n_val = 3 (N Value). While some nodes are unavailable, your application can still read and write the data, but only if the active nodes count is still fits your read/write quorum (can be configured from 1 to N). You can manage n_val default value across the cluster with command riak-admin bucket-type.

If for some reason you want to use a single node Riak cluster, in that case, you’d better create a new bucket with n_val: 1, otherwise, you will have N-copies (3 by default) of data on one node.

riak-admin bucket-type create test_bucket_01 ‘{“props”:{“n_val”:1}}’
riak-admin bucket-type activate test_bucket_01

Important note: While raising the value of N for a bucket or object shouldn’t cause problems, and it’s essential that you never lower N. If you do so, you can wind up with dead, i.e. unreachable data. This can happen because objects’ preflists, i.e. lists of vnodes responsible for the object, can end up.

A Riak’s bucket is not a distinct entity like a database in RDBMS — it’s a property of an object. When you change such features like n_val for specific existing buckets the N value is changed only for new objects with such property (for existing objects, the N value will be changed after the object is overwritten).

Example query, to set n_val:

# Set n_val: 5, for buckets list below, then check resultfor bucket in test_bucket_0 test_bucket_1 test_bucket_2; docurl -v -XPUT http://localhost:8098/buckets/$bucket/props -H "Content-Type: application/json" -d '{"props":{"n_val":5}}'
curl -v http://localhost:8098/buckets/$bucket/props

The problems we had with the replication.

One day we decided to set n_val to 5 for some bucket with important data. After that, we’ve got a significantly higher amount of transfers between cluster nodes. We’ve also recorded significant growth of object merges, and index writes and as a result — high latency and timeouts for write and read requests. For that moment we had seven nodes in the cluster, we decided to raise nodes to count 15 first (didn’t help) then of 20 (N*4), having decreased the initial number of CPU cores and the RAM amount in half per instance. We started to add new nodes one by one — it’s not a fast process, so if you do such operation on a production cluster, it’s better to wait until the node successfully joined and a cluster wholly rebalanced. Otherwise, you may have delays and a drop in performance due to a vast increase in internode transfers. A week after, our Riak cluster’s indicators got back to normal.

The initial recommendation says “using no fewer than five nodes” for production workloads. But in our practical experience, if you have a highly loaded cluster with a huge enough amount of data, be sure you have from 3 to 4 times more nodes than your N value (n_val).

Reaching the Max concurrency limit. When adding new nodes, replication sometimes may interrupt with the following error:

An outbound handoff of partition riak_kv_vnode 702...008 was terminated for reason: {shutdown,max_concurrency}

You can try to solve it by playing with transfer-limit value — set it higher than the default value (2), wait till transfers resume (and complete), then reset to default.

# Value could be set per node
riak-admin transfer_limit <node> <limit>
# Or across the cluster
riak-admin transfer_limit 4

Also, in peak moments LevelDB compactions can build up and block writing, that kind of issue can be solved by adding some retries and timeouts between a try (e.g. 5 retries with a 3-second timeout) of course if such way is acceptable for your kind of application.

Cluster’s node broke during data transfer. If one node of your cluster is failing due to hardware issue, you may found out that the node fails to start due to error like “Corruption: truncated record at the end of file”. That means that some of the files were not completely transferred during handoff. To solve that issue, you should find the broken index entry in crash.log, which looks like:

** Last event in was timeout
** When State == started
** Data == {state,125597796958124469533129165311555572001681702912,riak_kv_vnode,undefined,undefined,none,undefined,undefined,undefined,undefined,undefined,0}

Now, you need to remove that directory from the Riak data directory.
When using Bitcask:

rm -rf /var/lib/riak/bitcask/1255977969581244695331...72001681702912

Or in case you use LevelDB, that directory can be found in /var/lib/riak/leveldb.

This is undoubtedly a dangerous operation, and you should be very careful when using rm -rf. However, in Riak, it’s safe to delete some data as you have at least two copies of that data on other cluster nodes. Riak automatically restores that data when the node gets back online.

As you can see, there are no severe replication problems with Riak. But you have to keep in mind that it’s better to plan your cluster capacity with the big enough margin of compute and storage resources to mitigate possible growth risks. And of course, backups, backups, backups.


Riak is not an ideal database, and it lacks query language and other SQL-like abilities which some other databases have. But it’s one of the best solutions for some certain cases like storing a massive amount of unstructured data. Riak gives you the infrastructure durability and data safety along with the ability to read and write data with high throughput.

So, here are my thoughts on the advantages of Riak:

  • Scalability vertical, horizontal works easy. Adding/removing nodes operation works excellent.
  • Predictable performance.
  • Self-healing after a node failure.
  • Painless cluster operation, no apparent problems.
  • Well documented.
  • No configuration complexity.
  • You can not worry about the safety of your data, of course, if you do not forget about backups.
  • The downsides are not so meaningful.

Further reading

Riak Setup planning
Riak use cases
Riak developing

Ivan Tuzhilkin

Written by

Experienced problem solver and automator.


We are Kefir! We create true games

More From Medium

Also tagged Distributed Systems

Top on Medium

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade