How Medium Detects Hotspots in DynamoDB using ElasticSearch, Logstash and Kibana
Putting the ELK stack to work.
One of the most common issues you face when using DynamoDB, or any similar “Big Data” type database, is that it doesn’t access your data in a very uniform pattern. It’s a kind of issue commonly known as a hotspot or a hotkey.
Horizontal scaling and hotspots
Let me try to help you understand what I mean. (By the way, I’m using DynamoDB here as an example, but this is applicable to virtually any distributed data store: Aerospike, Cassandra, Couchbase, you name it.)
As your database grows in size or is being accessed more and more often , you start hitting limits. The limit could be the number of objects you are able to keep in memory, or on disk. Or it could be that your application starts sending too many concurrent requests, and your database has to queue more and more requests. Scaling vertically — adding more memory, more CPU — has its limits, so all modern databases get around these problems by scaling horizontally: adding more servers.


Let’s imagine that you have one database that can handle around 1,000 queries per seconds (QPS). If you manage to break up the data in two sets, and use two similar servers to store the same data, you can essentially get up to 2,000 QPS from your server — as long as you access your data uniformly.


Let’s imagine we’re storing people’s names in a database. Because we have a lot of people to store and we want to be able to access each of them really fast, we scale our database to 26 servers, one server per letter. We then break up the data by storing people’s information on those 26 servers using the first letter of the last name.
Now you could potentially get to 26,000 QPS — if the requests are made to all letters equally. However, if 99% of the requests are made for “John Doe”, you will mostly access the server containing the letter D, which means that your actual QPS will hardly be above 1,000, where we started.
This is what a hotspot or hotkey is.
In the world of Big Data, even though you usually talk to one endpoint, the same concepts apply. A DynamoDB table might be served by 20–30 different servers. But accessing the same keys over and over will create hotspots on a smaller subset, causing negative impact on overall performance.
Why are hotspots so hard to deal with in DynamoDB?
While hotspots are generally annoying to deal with, they are particularly hard in DynamoDB.
DynamoDB is a managed service, which means you can’t do much when it comes to administrating it. In terms of capacity, you can only set a desired throughput for your read and write capacity. Let’s say you create a table and provision it with 100 write capacity unit (WCU). As long as your consumption stays below capacity you will be fine, but when you try to consume more than you provisioned the table for — like 200 WCU — DynamoDB will start returning throttling errors on some of your queries.
What makes DynamoDB more challenging than competing technologies is that it takes the Read Capacity Unit (RCU) and Write Capacity Unit (WCU) values and divides them by the number of partitions your table has — without ever exposing how many partitions your table has. This makes it very difficult to optimize the provisioned capacity units, and you’re forced to over-provision the entire table to satisfy the needs of your busiest partition.
Take a look at this:


Looking at the last 24 hours, it seemed that we had way enough capacity (the red line) to handle all the writes (blue line). But when we look at the Throttled Write Requests graph, on the right hand side here, we see that we had spikes at 150,000 concurrent request throttled. Why?
Basically, if your table becomes too big, or if you need to be able to do a lot of concurrent requests, it gets partitioned automatically (This document explains it in more detail.) The number of partitions isn’t exposed anywhere, but it directly impacts how you should set your read and write capacity units. In fact, the provisioned capacity is to be divided by the number of partitions. Looking back at the graph above, we set the Provisioned write Capacity to 1,400 — but if the table is stored on 28 partitions, each partition only gets 50 WCU (1400/28). This makes the WCU per partition a lot closer to the average consumed capacity, and can be a possible reason why we are seeing a huge amount of throttled requests.
So getting back to the concept of uniform access pattern, if you happen to access mostly a certain subset of records, and if they happen to be located on a small subset of a partition, your table will suddenly start throttling. The capacity you actually need has been divided quite substantially.
It just goes downhill from here:
- DynamoDB doesn’t support the merging of partitions if you remove data or lower your RCU / WCU
- CloudTrail doesn’t have access logs.
- CloudWatch, until very recently, was only reporting every 5 minutes, now it’s down to 1 minute.
ELK to the rescue
Knowing that a lot of things are out of sight and out of our control, a team at Medium worked on a project to try to detect when we have a bad access pattern, and are accessing certain keys too often.
To do that, we created our own access log for DynamoDB. Here’s an example of the output.


That log file (db.log) is produced by each service that accesses DynamoDB. Those files get shipped to a pool of Logstash servers using LogstashForwarder. Logstash doesn’t really do much except sending the logs to ElasticSearch and then, in turn, we use Kibana4 to retrieve the data and make a nice visualization out of it.
Here are two examples.
Example 1: Uniform access pattern


Here, I’m looking at the mediumRequestTrace_2015_08 table. We can see that there is a very uniform access pattern in the bottom graph. Basically the top keys were all accessed twice. This is very much a best case scenario.
Example 2: Hot key


Here we are looking a table called postChunk. The top key in the DynamoDB Hot keys graph spikes at 400, totally dwarfing every other key. This is definitely an issue and if DynamoDB starts alerting on throttling, chances are that it’s because of that key. That means increasing the RCU (we are doing too many query operations) will probably not help.
In conclusion
As the trite quote from W. Edwards Deming says: “You can’t improve what you don’t measure”. By going through this exercise, we’re now a lot more efficient at detecting issues in our data access pattern. This helps us detect bugs, and — when we are being alerted to throttling issues — we have more insight into knowing what to do next.