How to Change a Production Database with Confidence and No Downtime

A story from the LIFE Fasting Tracker mobile development team

An architecture for the future

After that day of hacking, our database was performing well enough to give us some breathing room. We could then look at the bigger picture. When we analyzed the trend of performance of our mobile app over a couple of days, it was clear that our current system would not scale to fit the growth that we expect to get over the next year. Our LIFE Fasting Tracker app is heavily focused on social interactions, and the time to load the social feed updates of others was getting slower as more users signed up.

A new design and dark deployment

We had originally used AWS Aurora PostgreSQL for this particular system because it wasn’t clear exactly what sort of data shape or queries would be best for our use-cases. The flexibility of SQL was nice, but we needed to shift to the less flexible, more scalable behavior given by DynamoDB. We have a long history of DynamoDB at LifeOmic and we’ve always had great results with both performance and cost. After some design sessions, we agreed on a new architecture based on Kinesis and DynamoDB. The central tenant was to move all slowness into background event processing and to make each query a constant time lookup from DynamoDB. Maybe the details of that design will come out in a future story, but I think the process of the real production migration was the more interesting part of the work.

Data correctness

Even fast queries aren’t good if the the data returned is wrong, and with so many changes we wanted to be sure the data was consistent. Like the dual writes, we implemented parallel reads from both systems for all queries so that we could compare the data and performance. The results of the queries are essentially an ordered list of UUIDs, so the two results can easily be compared. With some code added to compare the results, the logs were immediately full of warnings about data mismatches.

Comparing performance

At this point, both storage systems were fully deployed and both were being used for all reads and all writes to our system. It was a worst case scenario in terms of cost — double work being done for all requests — but it gave us a very good environment to compare performance. Here is a graph of the two systems running in parallel with average and worst case times for both systems. We saw just want we wanted: the new system performance is much better than before and is staying just as fast as users and data grow.

Gradually migrating users

The data from the new system looked correct and the performance looked great. It was time to start switching users to the new system as their primary source. It might seem like switching users should be risk free, because so much data validation had already been done, but there were some slight differences between the two approaches that could not be easily validated on the server alone. Like most new applications, we already had a system for feature toggles, but the existing system only allowed for toggling features on or off based on account or by individual user. Instead of by individual, we wanted to enable the new storage system for 5% or 10% of the users at a time. It took an extra day of work to add support for non-binary feature toggles to our system, but it paid off for us, and the rest of the company can use it in the future.

  • If an individual user is enabled at the 5% mark, then they are also enabled at >5%.
  • The user enablement is spread out uniformly across the range of toggle values. A bad distribution would be all users being enabled at X% and all Facebook users being enabled at Y%. Ideally, gmail and Facebook users would be spread across the whole range of values.

Excitement just before the end

We were approaching the end of the migration. We raised the percentage of users to 30%, 50% and then to 75% of users using the new system. We started to get reports from users that one part of the application didn’t look quite right. The list of updates from their friends had the right entries, but it was showing that the lengths of people’s fasts were always ‘0 hours.’

Picture of success

Up until now, all reads and writes were still using both systems so that we could compare and safely roll back if needed. That had already paid off in terms of confidence and our ability to quickly react to the ‘0 hour’ goal bug. After a couple days of quiet, it was time to make the final switch to stop using the old system completely. A change was promoted that stopped looking at the feature toggle value and to always use the new system. See if you can tell when the switch over happened:

Life and Tech @ LifeOmic

LifeOmic and LIFE Fasting Tracker blog

Matt Lavin

Written by

A software engineer from birth who's slowly becoming a geek all aspects of life. Spending my free time trying to improve my health, relationships and finances

Life and Tech @ LifeOmic

LifeOmic and LIFE Fasting Tracker blog