Wave Invoicing Migration — Part 3

Published in

Engineering @ Wave

10 min readNov 24, 2020

By the end of our first project to migrate invoice reminders, we felt accomplished and optimistic about the future of the Wave Invoicing migration. We had real users working with an important part of our new system, and the migration engine we’d built to get them there had done its job well.

But, as with most early victories, the most difficult times were yet to come. We have many stories we could tell about our journey, but in this post we’ll describe the parts we thought would be most valuable to others attempting something similar.

Incremental Problems

As described previously, we decided to take an incremental approach to this migration. We told the story of building our new invoice reminders implementation and migrating all of our users to it. The end result was that migrated users were using a hybrid of the new and old invoicing systems whenever they logged into Wave.

In theory, this is a helpful approach. One of the major benefits of delivering a project in small increments is that it reduces your delivery risk: Mistakes or design flaws that go unnoticed in construction are much more likely to be revealed when real users get their hands on the product. By delivering in increments, you can prevent these unnoticed flaws from invisibly piling up on one another, and the feedback you get from your customers can be incorporated into the project much earlier.

You may be expecting us to say that we continued to proceed in this fashion; building another feature in the new system, and then migrating everyone to it, until the old system was completely gone.

In reality, things didn’t quite work out this way.

After reaching our first milestone, we quickly ran out of features that we could easily “extract” from the old system without dragging a bunch of related things along. This is the reality of working with legacy systems: Much of the time, the complexity of breaking dependencies and rebuilding small, self-contained parts of the system results in a prohibitive amount of overhead.

What we ended up doing instead was taking a bit of a gamble: We spent the next several months building the entirety of the new system in more or less one shot. We still continuously deployed changes to the new system to production so that we could test them internally, but none of these changes were exposed to our customers. These long months spent without concrete user feedback were very difficult for the team, and for the rest of the organization.

This decision certainly increased our delivery risk. The more we built without concrete user feedback and live production experience, the more likely it was that some big mistakes would fly under the radar. However, this risk was mitigated by the fact that the problem we were solving wasn’t entirely new to us: The team’s experience building and maintaining the original invoicing system made us more confident in some of the harder technical decisions we were making in the new one.

Eventually, when we were done with this “big build”, we rolled it out in phases to new customers.

We were able to do this part in smaller increments:

We first enabled regions where Wave does not support online payments. This gave us real experience in monitoring and operating the new system while we rebuilt the online payment integrations from the old one.
We then ported our two online payment systems (one powered by Stripe, the other by Wave Payments) one-by-one and made the system available where those payment options are used, outside North America first (Stripe) and in the US and Canada second (Wave Payments).

We want to emphasize the courage and resilience it takes both at the team and organizational level to see a project essentially “stand still” in terms of user adoption for months while a significant rebuild is going on. We were as transparent as possible at each stage about what we knew and what we didn’t, and were open about how we were making the decisions that were driving us forward. This hard work on the “soft stuff” maintained trust and connection with the rest of the organization, and resulted in a lot of support from all of Wave when things grew especially difficult.

It was also extremely hard for our team to go for so long without the satisfaction of shipping our work to customers. Prior to this migration project, we would deploy new features or fixes to our users many times a week, and a lot of our motivation came from seeing our work be put to good use. It was a long time before we could do this again.

It Works–Now Scale It

Our problems didn’t end here! The daunting task of migrating the rest of our customers still remained before us.

Early estimates suggested that it could take months to move them all. We needed to find ways to reduce that timeline, and we needed to do it quickly.

In the early stages of the migration, we were uncertain of how numerous and how severe the data problems in the source data might be. The original invoicing system had been running for years, and old data always contains new surprises.

For this reason, we made the (possibly) surprising decision to migrate our most active users first. These users were more likely to have large, complex data. We believed that by migrating them first, we would be able to detect the worst problems as early as possible, and better understand what we were facing.

At the same time, we also needed to increase the throughput of our migration process. The first version that we’d built for Invoice Reminders had served more as a proof-of-concept, and we needed to mature this tool to handle the scale of the data we were migrating.

We had many ideas of how to do this, so we established some early criteria to help us evaluate and prioritize them:

Data integrity: Above all else, we wanted to make sure the data we migrated was 100% correct. We could not afford to have our customers encounter data integrity issues that would shake their confidence in Wave.
User experience: Since our customers would be fully logged out of their account during the migration process, we wanted to do our best to make sure this happened when they were offline.
Safety: As much as possible, we wanted to handle migration problems when the whole team was available to provide support.

Using these principles, we made the following changes to the migration process:

Users were only migrated outside of their business hours, which we approximated based on their timezone.
Users were not migrated in time periods where they had important tasks scheduled to run (such as sending a pre-scheduled or recurring invoice). There was some complexity in making sure these scheduled tasks happened safely mid-migration, so we chose to avoid the problem altogether.
Wave’s infrastructure is under heavy load most mornings due to a number of upkeep processes that run at that time, so we forbid migrations from running during this 4-hour period to prevent them from being slowed down or disrupted by resource scarcity.
We improved the operational tools for the engineers who were responsible for defining and scheduling migration cohorts.

This brought us to the point where we could continuously run migrations at high throughput for about 20 hours every day, seven days a week, which drastically reduced the migration timeline.

Incidental Details

While we managed to get through the bulk of this risky project without causing any major problems for ourselves or our customers, we ran into a couple of serious production incidents when we reached our maximum migration throughput.

The first issue occurred when we began to push the limits of our AWS infrastructure. Our primary mechanism for controlling migration throughput was the number of background workers that were running migrations at any given time. As we progressively increased the number of workers, we used our Datadog dashboards and alerts to closely monitor several important system health metrics, including the CPU utilization and the number of IOPS (Input/Output operations Per Second) of our database.

These metrics noticeably increased at first, but not to the point of worrying us. However, at a certain point, some of our customers began experiencing errors while trying to use Wave Invoicing.

We noticed on our dashboard that IOPS had suddenly increased significantly. After doing some research, we realized that we had depleted our burst balance. In a nutshell, based on the RDS instance type, there is an IOPS limit that should not be surpassed for sustained periods of time. Once above it, we start consuming what is referred to as the burst balance. It will deplete until it reaches zero (which can happen in a matter of minutes, depending on the load) and once that happens, database performance will slow to a grinding halt, resulting in read/write errors (which was what our customers were experiencing). The only recourse was to reduce I/O as much as possible so that the burst balance could replenish.

The graphs below illustrate some of our key metrics during this incident. We can see the burst balance drop to zero shortly after 2 p.m., but we coincidentally stopped the migration process at this time and didn’t even notice the problem. This allowed the burst balance to recover until we resumed around 3 p.m.. It dropped to zero once more, but this time it stayed there for a long period of time–long enough for someone to get paged and stop the process.

At the beginning of the migration, we observed a large number of metrics to make sure we weren’t doing anything unsafe. This incident taught us the hard lesson of paying special attention to and alerting on large decreases in burst balance.

Our second major production incident was caused by a bug in one of our migration scripts.

In our first post, we described how it was very important for us to answer the question, “Has this user been migrated?” The primary storehouse for this information was our central Identity system, but for the purposes of the Invoicing migration, we locally stored a finer-grained migration status for each user.

Occasionally, a small number of users would start their migration, only to get stuck in an “in progress” state. Among other things, this meant they would be locked out of their account until someone manually fixed the issue for them.

This happened infrequently enough that we decided to just fix the symptom rather than investigating all of the minor things that could be causing the issue in the first place. We wrote a script that would return users to a state that permitted them to log in, which we ran periodically to clean up the few accounts that had gotten stuck in the last while.

We knew that some of the users stuck in this state had been “partially” migrated. Their data had been copied over to the new Invoicing system, but their verification checks had failed. Before we re-tried their migration at a later date, we wanted to delete this unused data in the new system.

One morning, we noticed that more users had fallen into this problematic state again, so we ran our script. We had used it many times before, and as usual, everything seemed to work fine–until we got reports from users who were concerned that all of their invoices had disappeared!

It turns out that these users had already been migrated. We discovered our script over-relied on the local migration status instead of consulting the “source of truth” in Identity. In this case, Identity knew that these users had been migrated, but our less-reliable local migration state suggested they hadn’t. This resulted in us treating their accounts as if they were partially migrated, and so the script deleted their data in the new system, which they were now using, so that we could re-run their migration. This affected about 170 of our most active users and significantly disrupted their business operations.

With the help of our operations team and other experienced Wavers who pitched in to help, we spent the next 2 days painfully restoring their data from backups.

The lessons we suggest you take away from this are:

When you delete data, be even more careful than you think you need to be. For example, consider adding a dry-run mode to scripts that delete data, so that you can run them beforehand to verify the outcome is what you expect.
Decide on a source of truth, and then listen to it! If synchronization to a local system is required for one reason or another, consider checking the source of truth in addition to the local state when making risky decisions on that information.
This shouldn’t be news, but make sure to always have a recent backup and know how to restore from it. In this case, ours was five minutes old so we didn’t lose out on much.
Be transparent with your users when something like this happens. It was difficult to let our customers know that we’d made a mistake that put their accounts at risk, but we wanted them to understand the severity of the problem so that they could plan their business operations around it.

Conclusion

After spending more than a year from start to finish, we rebuilt a new invoicing service, and migrated more than ninety million invoices and over half a billion related records. While this was a huge undertaking, it also unlocked the ability to finally improve our invoicing product. The team can now focus on what matters the most: Building new features to help our business owners get paid easier and faster.

We hope that you enjoyed learning about our journey through this series of posts and that perhaps it will help you in a future endeavour!

Co-authored by Jason Colburne and Emmanuel Ballerini, edited by Michael DiBernardo. Thanks to Joe Pierri, Nick Presta and Joe Crawford for the reviews.

Wave Invoicing Migration — Part 3

Incremental Problems

It Works–Now Scale It

Incidental Details

Conclusion

Written by Emmanuel Ballerini