Organising your Dynamo DB metrics into a dashboard

Saswat Raj
3 min readApr 28, 2018

--

I’m not obsessed with cleanliness, nor OCD about organising everything around me, but one thing that I’ve learnt in my line of work, as an SDE is disaster management. When that odd piece of code that you had approved, breaks in production, you have a situation similar to a counter strike gameplay — when the opposition has planted the bomb and you are the sole surviver. You need to carefully assess your every action, figure out which place to go next and how to fix the situation. But unlike the virtual world, in the software world you have pointers to help you out — logs and metrics. They are like the ‘Force’ in all the code that you write and may they be with you !!.

Now, my team has been working on data storage for a particular project and has been thinking of using DynamoDB provided by AWS for specific reasons. But, whether you like surprises or not, in the world written in code, there are situations that need your swift/automated action for making sure that your persistence layer is working as expected. This is where the DDB metrics provided by AWS CloudWatch comes real handy. However, it’s really confusing to figure out which metrics to look to, for a particular issue or relations between the metrics when looked at them individually. A good metrics dashboard is one which relates metrics together, so that you can root cause an issue through the dashboard itself. So, I spent some “Monica time” and organised the metrics into a nice dashboard for a sample project. Here’s what the end sections were.

The Throughput Section

This section contains the metrics relevant to the monitoring of provisioned throughput. It is important to be notified if the total capacity for the Dynamo operations crosses the allotted capacity. If this were to happen, subsequent requests/events are throttled, this impacting your customers. The following metrics can be grouped (keeping the table as a dimension) to provide insights on the throughput of the DynamoDB table.

  • Provisioned Write Capacity Units (Dimension: Table, Metric: Maximum and Minimum)
  • Provisioned Read Capacity Units (Dimension: Table, Metric: Maximum and Minimum)
  • Consumed Read Capacity Units (Dimension: Table, Metric: Sum and Maximum)
  • Consumed Write Capacity Units (Dimension: Table, Metric: Sum and Maximum)
  • OnlineIndex Consumed Capacity (Dimension: Table, Metric: Sum)

These metrics help show:

  • Distribution of read/write load to the provisioned capacities.
  • Distribution of RCU and WCU per request
  • Consumption of capacity for a new index creation
  • Notifying for AutoScaling when consumed RCU/WCU crosses a particular threshold

The Errors Section

To err is not just for humans, but for all the applications developed by them too. Accounting for the error metrics is an integral part of any software development process. DynamoDB publishes the following error metrics:

  • System Errors
  • User Errors
  • Conditional Checks Failed Requests

All these errors are mutually exclusive, customer impacting and need to be alarmed on.

The Latency Section

All of us have been to the grocery store and have waited in the checkout line for our turn. If you’ve ever been pissed off by that, imagine the same for your customers, waiting for a response back from your application. In the persistence layer, latency is measured using the following metrics:

  • Returned Bytes (Dimension: Table and Operation, Metric: Sum and Maximum)
  • Returned Item Count (Dimension: Table and Operation, Metric: Sum and Maximum)
  • Successful Request Latency (Dimension: Table and Operation, Metric: Maximum )

The Throttle Section

While the metrics of this section can be derived from the throughput section, why over engineer when DynamoDB emits these directly to CloudWatch.

  • Throttled Requests
  • Read Throttled Events
  • Write Throttled Events
  • Online Index Throttled Events

All of these metrics can be logged at a table level. Their sum indicating the number of total of the events that have been throttled. A request can have many events and that’s where the Throttled Requests metric differs from the rest. This metric is the one which should be definitely alarmed on.

In addition to these metrics, make sure to record query level client side metrics for response size, latency, error count, throttles etc. Hope the above, helps in reducing your time for creating the OE dashboard for your application. For suggestions and queries please reach out to me at saswatrj2010@gmail.com.

--

--