From Zero to 40 Billion Links: Our Journey Migrating to DynamoDB
At Branch, links are our lifeblood. Every day, our partners create more than 100 million links on our platform, and we serve roughly the same number of link requests. For nearly three years, we have stored these links (currently over 40 billion) in a commercial key value store. It is high performance, with clustering, secondary indexing, UDFs, and scanning functionality. When we started with this legacy system (replacing our previous sharded PostgreSQL setup), we were particularly impressed with its speed and clustering capabilities. It was fast, very fast, and the headroom allowed us to scale successfully through a period of extreme growth as our total service request volume grew from around 50 million requests a day to over 5 billion. Early on, we configured a maximum cluster size of around 60 nodes for our main cluster, and set a replication factor of 2 (one duplicate copy of each data item), which gave us a ton of room to grow.
Fast forward several years to 2017, and we began to run into a variety of challenges as our overall traffic scaled 100x. Our legacy cluster was reaching the maximum node count, and we had already gone through an iteration of upsizing our EC2 instances. Upsizing involved swapping out one node at a time, and letting the data rebalance, which had both a heavy operational cost and some risk. Because of the replication factor of two, data loss could occur if there was another failure during the rebalancing.
We had a wall ahead of us — we could only increase our EC2 instance size one more time before we would need to reconfigure the cluster to allow more than 60 nodes, which required a full shutdown, config edit, and restart: many hours of downtime.
Equally critical, our EC2 instance and storage costs were growing much too rapidly as our business expanded. So we began to explore alternatives. We wanted to be able to:
- Grow both our storage and our throughput at least 10x, ideally with a clear path to even greater growth
- Cut overall expenses
- Have a predictable cost model for that growth, with no worse than linear cost increases despite rapid and/or uneven growth
- Increase our reliability
- Increase data durability
- Decrease operational costs
One area where we could make some trade-offs was in latency. The legacy system was providing consistent sub-millisecond latency, which was great, but faster than we actually needed.
We looked at the usual key-value store suspects, as well as several options for building our own special purpose storage engine. We run a very lean operations team, and we wanted to be sure that a new solution would also be easier on them. Due to its reliability, multiple Availability Zone (AZ) support, scalability, dramatically decreased operational overhead, and also huge cost savings, we went with DynamoDB, Amazon’s native key-value store.
We did not have any operational experience with DynamoDB, and we were moving our most critical dataset into something new with a hard stop on the horizon. There were a lot of things we needed to get right, and very few things that we could afford to have go wrong. Perhaps one of our founders put it best: “I’ve never had a good first experience with a database.”
As we began to dig into DynamoDB, the first issue we faced was designing an appropriate schema to fit our access patterns. In general, our recent links are requested far more frequently than older links, and the link ids have a time-based component. However, we do have a smaller number of popular links that remain very active, sometimes for years. With DynamoDB, we provision read and write capacity per table and need to balance load across the table.
We looked at creating new tables periodically, keeping them small and with much higher provisioned throughput, and then lowering throughput as they aged. However, we were still concerned about uneven load. DynamoDB divides your data into 10GB partitions. At the time we were looking into it, DynamoDB did not have Adaptive Capacity, meaning that each partition is allocated an equal fraction of the total read and write capacity. Even if you have 5000 Write Capacity Units provisioned for your table, if you have 1000 partitions (only 10TB of data), each partition will only have 5 guaranteed writes per second. Since then, DynamoDB has introduced Adaptive Capacity, which dynamically adjusts per partition throughput based on real-time traffic. Because of our data size, we were anticipating thousands of partitions, meaning that each partition might only have an IOP or less per second available. This was potentially fine for old and rarely used data, but could cause problems if too many old but popular links ended up clustered together.
We also looked at strategies for grouping data by using both a partition key and a sort key in DynamoDB, which would have potentially allowed us to, for example, easily iterate over links for a particular partner. However, the downside of grouping links this way was that the read/write volume would not be evenly distributed across our partitions. We also looked at several strategies to move the smaller number of popular links out of the older tables and back into the hot table(s). We discarded this approach because of the increased complexity around data movement, race conditions, and key lookup.
However, in retrospect, we were simply confused by the new set of choices presented by DynamoDB. We were able to solve the problem simply: using only a single primary key instead of the partition+sort key and/or an aging strategy. With proper hashing of the primary key, we would get an even data spread while writing. Distinct reads would also be evenly spread across the partitions (with the exception of some variance), which left only the problem of repeated reads for popular links, which was easily solved using caching.
DynamoDB has an integrated caching solution called DAX (DynamoDB Accelerator), with clients in multiple languages. Because we wanted increased reliability, DAX’s cross-AZ replication also simplified the setup and maintenance of our critical caching layer. With DAX, the read latency for popular and recent links was sub-millisecond, similar enough to what we were getting from the legacy system, and more than fast enough for our needs. The DAX setup process was very simple. The DAX client offers an identical interface to DynamoDB, so we just swapped in the cache-aware client, and immediately began to take advantage of write-through caching.
Rebuilding the Moving Train
Data safety and integrity were our top priorities. Whenever possible, we opted for smaller, simpler changes that were easy to reverse. At Branch, we have a sophisticated system for doing partial feature activation, allowing us to activate for either a percentage of our traffic or specific accounts. We had a basic mechanism in place to gain confidence with the new code, starting with as small as a one percent rollout.
Stage one, we began to dual-write our new data and updates. We wrote each link to the legacy system, and also to DynamoDB, and rolled this out gradually. One significant change was that we have much more control over DynamoDB’s read and write throughput, and needed to carefully pre-provision our capacity, typically done with a few clicks or a CLI command.
Once we had hammered out the schema design, and rolled out our dual-write system, we then began to turn on dual reads, with DynamoDB being the primary. For data not in DynamoDB (initially nearly everything), we’d fall back to reading from the legacy system.
Once we had data flowing in and out of DynamoDB cleanly, we then had a huge set of legacy data to migrate.
Simply scanning through all of our data in the legacy system took several days for each iteration. We did each migration in two steps, one full scan to obtain a list of link IDs to be migrated, and then a second job to migrate individual records in parallel. Although this was inefficient for the legacy system (its scans are faster than individual record lookup), it allowed us to continue the migration in the event of any hiccups — it didn’t provide any capability to restart a scan that has been interrupted, and rewriting the entire dataset was a significant expense.
We ran a single scanning node, and then, once the IDs were retrieved, had six worker nodes that were doing the actual data migration, reading from the legacy system and then immediately writing to DynamoDB. We dramatically scaled up the DynamoDB write capacity during the migration, using conditional writes to avoid overwriting existing data with a stale copy. The dual-write update code path already handled ongoing updates and creates, and ran throughout the migration process.
After completing the initial migration, we did a complete scan of the DynamoDB data to sanity check that nothing had fallen through the cracks. This was one of the places where DynamoDB shined most brightly. We were able to manually scale up the read capacity of a table by nearly 40x (nearly 100k RPS) in a matter of minutes to run a fast scan, maxing out the capacity of our worker nodes. Each scan worker had its own pointer, and could resume in the event of an interruption. Once we were done with the scan, DynamoDB auto-scaled the table back down. (To be fair, the legacy system handled relatively similar throughput without scaling, but we were paying a continuous cost for that Ferrari-like performance.)
Furthermore, we saw no indications that we couldn’t have scaled up DynamoDB all the way up to a million IOPS per second if we needed to (and wanted to pay for it). Because we have thousands of partitions, getting several hundred thousand aggregate reads or writes per second wouldn’t be a dramatic increase on any particular partition.
We can also add new metrics with only two lines of code, and we took advantage of this to track a variety of indicators related to migration, dual writes, and dual reads. Our key metric was the percentage of reads that came from DynamoDB vs. the legacy system. It was very reassuring to add additional nines to that number, and to track down a few additional write paths and measurement errors before we were finally able to stop adding new data to the old system.
After we had migrated all of the data, and checked and double checked our processes and our data, we were ready for the point of no return: stopping the dual writes. After stopwrite, data would no longer be written to the legacy system, and it would be much harder to backfill missing data (we’d need to rerun our migration process, or pull data, from other archival systems). As we slowly scaled up the stopwrite feature activation over a period of several days, we continued to watch our metrics and were pleased to see throughput, latency, and error rates remain constant. In addition, we were very happy to to see DynamoDB traffic reach 100%, meaning we were not writing new data to the legacy system.
Our final step was to stop dual reads once all the data had been deleted from the legacy system and we verified no new data was flowing in from undocumented paths. We then downsized the legacy cluster a node at a time until we had only non-link data — we continue to use that same (but much smaller) cluster for other data.
- We need to be more careful about read/write capacity in DynamoDB, as we don’t have nearly as much headroom as we did with the legacy system. Autoscaling is very good for gradual traffic pattern waves, but responds over a period of minutes. Our own scan jobs need to have their capacity pre-provisioned, or ramp up slowly to avoid throttling.
- One clear mistake was our initial attempt to run the entire migration through DAX. Although DynamoDB scaled easily with all our capacity needs and has been very reliable, we found that we could overwhelm DAX pretty quickly, particularly with our 1 primary + 3 replicas setup. Eventually we just wrote legacy data directly to DynamoDB, bypassing DAX. In retrospect, we didn’t want to pollute our cache layer with a large amount of older data anyway.
- We have spent considerable time fine-tuning our DAX setup, and it continues to be a work in progress as we balance load, throttling, and DAX vs. DynamoDB read costs.
- We ended up moving from a Global Secondary Index to a separate lookup table to enforce a unique constraint: GSIs do not provide this.
Many thanks to my team members Ryan Churaman and Michael Nordberg, who did a ton of work to make this all go smoothly!
The new DynamoDB system:
- Costs about ⅓ as much as the legacy system
- Has 3x replication in multiple Availability Zones (vs. 2x in a single AZ for the legacy system)
- Can (and has) scaled I/O throughput up 40x+ when necessary in a matter of minutes
- Takes advantage of DynamoDB’s auto-scaling feature throughout the day as our traffic ebbs and flows
- Gives us a very predictable runway for at least another 10x growth cycle
- Runs as a managed service, freeing up many cycles for our operations team.
- Provides consistent ~2-3ms read latency, with sub-millisecond latency for new and popular items in the DAX cache
We still use the legacy system effectively for smaller, higher-churn data sets where we need the speed and/or where paying by IOP is more expensive. However, with tens of billions of links, and tens of terabytes of data migrated and in active use, we can say that we’ve had a good first experience with DynamoDB.
Interested in the engineering challenges discussed here? Come help us out, and checkout https://branch.io/careers/