How to Change a Production Database with Confidence and No Downtime
A story from the LIFE Fasting Tracker mobile development team
I first heard about our database trouble from a PagerDuty alert at 7 AM on a Saturday morning. Not yet out of bed, I had just finished planning how I would spend my day. But instead it would be consumed with analyzing, debugging and tweaking SQL queries until our performance metrics were back in the range of acceptable.
The immediate performance of our system was good enough, but I knew that the next couple of months would be filled with designing an architecture that would need to scale for our future growth. After implementing the new design, seamlessly switching databases without disrupting 50k app users was going to be the next challenge.
An architecture for the future
After that day of hacking, our database was performing well enough to give us some breathing room. We could then look at the bigger picture. When we analyzed the trend of performance of our mobile app over a couple of days, it was clear that our current system would not scale to fit the growth that we expect to get over the next year. Our LIFE Fasting Tracker app is heavily focused on social interactions, and the time to load the social feed updates of others was getting slower as more users signed up.
What we needed was a storage system that would stay fast even as our users multiplied. The system that we had in place was built around PostgreSQL and involved many JOINs to collect all of the updates from people in one’s fasting Circles. As the number of users grew, the cost of some of those JOINs was growing too.
Here is a picture that shows the ever increasing time for the worst case request time for one of the services that was slowing down:
A new design and dark deployment
We had originally used AWS Aurora PostgreSQL for this particular system because it wasn’t clear exactly what sort of data shape or queries would be best for our use-cases. The flexibility of SQL was nice, but we needed to shift to the less flexible, more scalable behavior given by DynamoDB. We have a long history of DynamoDB at LifeOmic and we’ve always had great results with both performance and cost. After some design sessions, we agreed on a new architecture based on Kinesis and DynamoDB. The central tenant was to move all slowness into background event processing and to make each query a constant time lookup from DynamoDB. Maybe the details of that design will come out in a future story, but I think the process of the real production migration was the more interesting part of the work.
With a new design in hand and the mobile app already heavily in use, we had to roll things out without causing any trouble. The first step was to get the new DynamoDB tables and Kinesis streams deployed and in use. To ensure that everything was working, each write was modified to send updates to both the old and new systems in parallel. To make sure that no existing clients would be affected, all code associated with the new writes was protected with aggressive timeouts and error handling so that any failure or unexpected slowness would go to the logs and not our users. Of course there were a couple cases of both, but with the logs they were quickly found and fixed. Soon enough, both storage systems were being populated, and performance looked good. But the real test would be if the data could be read quickly.
Even fast queries aren’t good if the the data returned is wrong, and with so many changes we wanted to be sure the data was consistent. Like the dual writes, we implemented parallel reads from both systems for all queries so that we could compare the data and performance. The results of the queries are essentially an ordered list of UUIDs, so the two results can easily be compared. With some code added to compare the results, the logs were immediately full of warnings about data mismatches.
With thousands of queries, we needed a way to quickly find out how inconsistent the data was. Here is an example of what we ended up logging:
In the picture above you can see some analytics that were added to all mismatches to understand what sort of differences were found. We added the
overlapDetails attribute to log messages that had partially overlapping update lists. With the extra attribute, we used SumoLogic filtering to find the most interesting mismatches. It turned out that, while there were a lot of logged mismatches, almost all of them were like the one above. In that case, one of the data sources was slightly ahead of the other and included one extra entry at the start, a long string of matches and then one less element at the end. That sort of mismatch is fine, because it was caused by a slightly slower write time in the new system. Other than those time shifts, there were only a handful of other inconsistencies, and they were also minor reorderings caused by timing differences.
At this point, both storage systems were fully deployed and both were being used for all reads and all writes to our system. It was a worst case scenario in terms of cost — double work being done for all requests — but it gave us a very good environment to compare performance. Here is a graph of the two systems running in parallel with average and worst case times for both systems. We saw just want we wanted: the new system performance is much better than before and is staying just as fast as users and data grow.
Gradually migrating users
The data from the new system looked correct and the performance looked great. It was time to start switching users to the new system as their primary source. It might seem like switching users should be risk free, because so much data validation had already been done, but there were some slight differences between the two approaches that could not be easily validated on the server alone. Like most new applications, we already had a system for feature toggles, but the existing system only allowed for toggling features on or off based on account or by individual user. Instead of by individual, we wanted to enable the new storage system for 5% or 10% of the users at a time. It took an extra day of work to add support for non-binary feature toggles to our system, but it paid off for us, and the rest of the company can use it in the future.
With a system of gradual feature toggle in place, we started by enabling the new storage system for 5% of users and let it sit for a day. No errors were detected, so we upped to 10% and then 20%. We found a couple of edge case bugs, quickly fixed them, and let the system cruise at 20% for a bit.
The details of feature enablement by percentage are interesting enough to justify a little more explanation. With boolean feature toggles, a user is explicitly enabled or not and the answer doesn’t change unless the feature toggle value is changed. With non-binary toggle values, like 5%, deciding whether the toggle is enabled for a user is less obvious. There are a couple of behaviors that we wanted to ensure.
- At a toggle value of 5%, then 5% of users should be enabled and the toggle should be enabled for every request to the server for that user.
- If an individual user is enabled at the 5% mark, then they are also enabled at >5%.
- The user enablement is spread out uniformly across the range of toggle values. A bad distribution would be all @gmail.com users being enabled at X% and all Facebook users being enabled at Y%. Ideally, gmail and Facebook users would be spread across the whole range of values.
We open sourced a small library to help with that consistent and uniform distribution of users in case others are looking for the same behavior.
Excitement just before the end
We were approaching the end of the migration. We raised the percentage of users to 30%, 50% and then to 75% of users using the new system. We started to get reports from users that one part of the application didn’t look quite right. The list of updates from their friends had the right entries, but it was showing that the lengths of people’s fasts were always ‘0 hours.’
Thanking ourselves for using feature toggles to control storage use, we immediately changed the feature toggle value back to 0%, and within minutes the application was working correctly again. Without the stress of bad behavior weighing on us, the problem could be debugged and fixed calmly and carefully. There was no need for another all-day hack-fest from the start of this story.
After a peaceful night’s sleep, a fix for the problem was implemented, pushed to production, and we were back to 75% users on the new system by lunch time the next day. Soon enough we were at 100% and we were not hearing any complaints about any unexpected behavior.
Picture of success
Up until now, all reads and writes were still using both systems so that we could compare and safely roll back if needed. That had already paid off in terms of confidence and our ability to quickly react to the ‘0 hour’ goal bug. After a couple days of quiet, it was time to make the final switch to stop using the old system completely. A change was promoted that stopped looking at the feature toggle value and to always use the new system. See if you can tell when the switch over happened:
That picture validated that our new architecture was worth the work! Building on the new architecture, we’ve been able to add more social features without seeing reduced performance. New features including fasting Circle specific announcements, one-to-one status updates and filtering updates by author could now be added without impacting performance by building on what we put in place.