Building Talko with DynamoDB

Lessons Learned

Ransom Richardson
Tap the Mic

--

At Talko we’ve been using DynamoDB since it was first released. This post tells why we chose it and why we’ve stuck with it. I give some examples of some of the more interesting ways we use it. Finally I share some lessons we’ve learned in using DyanmoDB effectively.

Having some cursory knowledge of the Talko app, and what a Talko call is, may help understanding of some parts of this post. You can take a peak at the Introducing Talko post or our website.

Why DynamoDB?

Managed

The Talko service team is very small and we are responsible for everything from design to code to operations. Having a fully managed database has allowed us to focus our energy on building our application.

Prior to using DynamoDB we had been using RDS (Amazon managed MySQL). DynamoDB has been even simpler to manage, with no need to worry about configuration or upgrades. Most importantly we haven’t had to manage performance beyond deciding how much throughput to provision on each table.

Reliable

From the beginning Talko has prioritized building a reliable and scalable service. Even with RDS configured for multi-AZ operation we would see the database be unavailable for a couple of minutes during a failover. That length of outage was unacceptable for our service, so we needed to find a different solution.

Since switching to DynamoDB we have not had any outages due to database unavailability. We do regularly see some operations timeout or take much longer than normal. There have also been a couple of short periods when a large number of operations have had transient failures. However, unlike the complete outage during MySQL failover, we have been able to continue functioning during these events. See Lessons Learned below for some tips that have helped us keep everything running.

Scalable

Even when using MySQL we designed with future scalability needs in mind. We were careful to avoid the usage of MySQL features like multiple indexes and JOINs that would make sharding difficult. As a result we were using MySQL basically as a key/value store. However we had the concern that without actually doing the sharding scalability might not go as smoothly as hoped. Also it was likely that over time expedience would win out over strict data schema and we would begin to use SQL features that would make it hard to scale in the future.

With DynamoDB the API is limited to operations that will scale, so the temptation to take short-cuts has been removed. Sharding takes place automatically with no need for us to manage it.

Consistent

All DynamoDB write operations are strongly consistent, and most read operations have the option of strong consistency. This was a key feature for us in choosing a database. We take advantage of strong consistency to guarantee uniqueness of items in the database. There are examples in the next section.

Why Not?

DynamoDB has been a great fit for our usage, but it has some significant limitations. There is no ability to run your own code within the database, so doing complex operations over an entire table requires pulling the data out of the table for processing. That can be done using the DynamoDB integration with EMR and Redshift, but DynamoDB is not the best option when that is the primary type of workload. Reads in DynamoDB are much cheaper than writes, so if you have a write heavy workload it could get expensive. DynamoDB works best with small items. Despite the features in the latest release to make it easier to deal with larger items, the cost of operations is still determined by the item size.

Example Usage

We use over 30 DynamoDB tables. Most of these are pretty simple and in line with documented examples. In this section I’m going to give a few more advanced examples from our real usage.

Team Tag Call

A key Talko feature is the ability to tag calls and then lookup to find all calls with a given tag within a team. To implement this feature we needed a way to support lookup by team id and tag and return the ids of all calls with that tag.

The simplest solution would be to use team id as the primary key and tag as the secondary key and then store the call ids as a set within the DynamoDB item. However this design makes it hard to store any additional information about the usage of the tag in that call. In our case we wanted to store the count of how many times the tag was used within the call.

Our implementation uses team id as the primary key and a concatenation of the tag and call id as the secondary key. This gives us an item for each team/tag/call that can store the count. And we can still do an efficient query for all calls with a tag by using a begins_with condition on the secondary index to only lookup the tag that we are interested in.

Log Tables

Talko is designed to work well on mobile clients even in poor networking conditions. It supports offline usage and synchronizes only what has changed when a client reconnects. This is implemented using a synchronization algorithm where the client tells the service what items it has and the service sends any items that the client is missing.

For example, there is a log for each Talko call that contains items for each short splice of audio and each post within that call. Each item has a sequential sequence number (1, 2, 3, …). It is possible for the client to receive updates out of order. So it uses ranges to tell what items it has — for example, it might send a message saying that it has items with sequence 1–97, 105 and 110–200. Using sequential numbers for the sequence enables these ranges — if we had used something like timestamps we would have no way to know where there were gaps.

To store this in DynamoDB we have a table that contains the log for all calls. The call id is the primary key and the sequence is the secondary key. To write a new item we first do a lookup to find the highest sequence that has been used. (This query can be done efficiently by specifying ScanIndexForward is false and Limit of 1). We then do a conditional write with the next sequence to guarantee that each sequence is used only once.

The query API supports looking up based on ranges in the secondary index, which we use to lookup only the items that we need to sync to the client.

Service Discovery

For scalability and reliability we expect that our cloud instances can stop at any time and new instances can be started. In fact, every time we deploy new code we start new instances and take down the old ones. This makes service discovery an important problem for us to solve.

Originally we had used Zookeeper to keep track of what instances were running. But we found managing Zookeeper in the cloud to be a bit of a headache and had some other issues getting it to work well and detect instance stops as quickly as we wanted.

Since DynamoDB already provided us with a reliable, distributed source of truth we decided to use it to store a table of all of our instances. We use instance type as a primary key and IP address as the secondary key. Each instance updates its item in the table every second (the cost of this is small relative to the cost of the instance). We detect instances that are down by looking for rows that have not been updated for a while. We have implemented it in a way that the service can continue to use cached values in the event that there is an issue updating DynamoDB and make careful use of conditional updates to make sure that we don’t accidentally remove a live instance from the table.

Note that this solution won’t scale indefinitely since the number of distinct primary keys is limited by the number of instance types we have. But so far this solution has worked well for us, allowing us to do service discovery without requiring additional instances or managing a separate application.

Lessons Learned

  1. Use Lots of Tables: We use over 30 tables. Having lots of tables allows us to specialize the key schema and indexes to efficiently implement multiple access patterns. For example, the call tagging feature described above uses three tables. In addition to the Channel/Tag/Call table mentioned above we have a Channel/Tag table that contains a subset of the data in a more efficient form for querying which tags are used in which channel and a Call/Tag table that we use to determine where in the call each tag is used. Although it is good for efficiency, more tables can cause consistency issues as there is no way to do updates consistently across multiple tables. We have found that we can minimize the impact of this by carefully ordering operations.
  2. Use Lots of Items: We have lots of tables that track members of a set (e.g. members of a team). We could implement this using a single item per team with a set attribute containing the members, and that may be cheaper for some common access patterns. But we have found that it is almost always preferable to have an item per member. That gives more flexibility to store additional data per member and can still be queried efficiently.
  3. Estimate Costs: DynamoDB has a transparent cost structure that makes it easy to estimate the cost of various features or APIs in terms of the number of read and write units they will consume. By thinking about this as we are designing the APIs we are able to understand the cost of features and optimize our table schemas where necessary.
  4. Avoid Large Items: If you aren’t careful large items can cause your costs to increase significantly. In our original implementation we stored image thumbnails inline in the call log table. Even though they were only a few KB, this significantly increased the size of items. We found it much more effective to store the thumbnails in S3 and store links to them in DynamoDB. There are other options, such as moving the large part of the item to a separate DynamoDB table or using the JSON document support that make large items more tractable, but we have found it better to avoid them when possible.
  5. Indexes: Indexes are powerful, but getting indexes correct can be complicated. There are a lot of options and they affect the cost of operations in many ways. Beware of subtleties; for example, the item key must be unique but there is no way to require uniqueness on a secondary index. We use relatively few indexes, but that is partly because we created many of our tables before indexes were available. Using indexes has allowed us to combine multiple tables into one, improving consistency and easy of use.
  6. Practice Data Migration: When we moved our data from MySQL to DynamoDB we did it table by table and didn’t have any service outages. Even though we only had a handful of users at the time and probably could have done the migration more efficiently, the practice making significant changes to our live database has been very useful. Since then we have done a number of other data migrations from one table to another within DynamoDB. This has allowed us to adjust table schemas to take advantage of indexes and changes in expected usage.
  7. Retries: Any DynamoDB operation may fail and return an error. DynamoDB client libraries will automatically retry operations for you. It is worth understanding how the library you are using does this. We have found it very useful to have logging each time there is a failure and retry. That has helped us debug some issues as well as understand the behavior of our system when DynamoDB is failing more often than usual.
  8. Timeouts: The timeouts on DynamoDB requests is also important. If the timeout is too large a request may take too long to timeout and retry. We have seen a long timeout followed by quick success on the next retry. To optimize this case we use a shorter timeout (currently one second — although that is too short for some large scan operations) on the first attempt. Then we use a longer timeout on retries in case DynamoDB is just slow.
  9. Conditionals: Because of timeouts and retries it is important that DynamoDB requests are idempotent. This can be an issue if you use conditional writes. It could be that the first attempt succeeds but the client times out and retries before getting the response. If this is a conditional operation, for example creating an item with a check to make sure it doesn’t already exist, then the retry could fail the condition check and return an error. It is important that you handle this.
  10. Understand Behavior at the Limits: In addition to our production deployment we have a completely separate deployment that we use internally to test. Given the small number of users, the load in this environment is uneven. We keep the DynamoDB provisioned throughput low so that we regularly hit our throughput limits. Hitting the limit results in failures and retries and looks much like other types of DynamoDB failures. So hitting limits regularly gives us experience with how failures are handled. That has given us confidence to not provision our production deployment to handle any possible traffic spike — we know that it will retry correctly and continue to function even with some failures.
  11. Prefer Reads Over Writes: Reads are a lot cheaper than writes. For some operations, like ensuring an item exists when it is likely that it already exists, you can optimize your costs by reading before writing. A conditional write to create the item if it doesn’t exist always costs you a write unit. But if you read first to see if the item exists then you can save the write in the case when it already does.

I’d love to hear how others are using DynamoDB. Let me know what you think here on Medium, or I can be reached at ransomr@talko.com.

--

--