Wave Invoicing Migration — Part 2

Published in

Engineering @ Wave

9 min readOct 15, 2020

In the last post, we introduced the extensibility problem we faced with our original Wave Invoicing system in 2018 and how difficult it was to improve it.

Our biggest obstacle was our fear of the unknown. It is extremely difficult to reason about the issues you might cause by making even a simple change to a legacy system. When faced with the prospect of rebuilding it from the ground up, this problem is much scarier.

We believed that the best way to build confidence in our ability to succeed was to approach the problem in small but challenging steps.

It’s one thing to address “toy” problems that are easy to solve and ship, but that doesn’t build confidence: It instead creates a worse fear that we are kicking the harder problems down the road. We wanted the first problem we solved to look as much as possible like a small but complete version of the overarching problem at hand.

After some debate, we decided that we could rebuild one feature of the invoicing application and then migrate all of our users to it. With this approach, Wave Invoicing would become a hybrid of old and new for all of our users, and over time the proportion of “new” would grow until the legacy component was completely gone.

While simple in theory, it is quite difficult to find a meaningful part of a legacy system that can be moved without breaking a large number of dependencies. For example, the core feature of creating and sending invoices was deeply embedded into both the software and data architecture, which made it frightening to extract at this stage.

We looked carefully at the data model for hints of where the strongest dependencies could be. While it’s possible to run into serious problems disentangling tightly-coupled code, our experience was that isolating and moving tightly-coupled data was a much worse problem to have.

Two entities in our Invoicing data model stood out as potential candidates:

Invoice settings, which specify defaults such as the logo and accent colour of an invoice.
Invoice reminders, which business owners schedule in advance to remind their customer to pay an invoice.

We chose invoice reminders as our starting point because:

Its usage in code was substantial, but referenced only about half as often as invoice settings.
Downtime for this feature would be reasonably easy to recover from, and would be less immediately noticeable. Errors in invoice settings were seen as riskier, because they could substantially impact the invoice experience.

The plan was to:

Rebuild the invoice reminder feature in the new system.
Build a migration engine in the new system.
Use the engine to migrate all users to the new invoice reminders.

We’ll save the discussion of the thinking that went into our brand-new invoicing system in another series. In this post, we’ll focus on the migration process.

Designing The Migration Process

When we hear the term “migration engine”, we might naturally start thinking of things like job queues and data copying. However, our primary concern was actually quite a bit simpler than this. Because Wave users were going to be moving from the legacy system to a hybrid of legacy and new, we needed to be able to safely and reliably answer the question, “Has this user been migrated?”

It’s natural to think of this as a straightforward yes-or-no question, but there is an important third possibility to consider: At any given time, the user could be in the process of being migrated.

While the invoice reminder feature represented a small fraction of the overall Wave Invoicing data footprint, we knew that the data migration for later features could take a long time to complete and verify, so we needed to consider the experience a user might have if they attempted to use Wave Invoicing during their migration.

For example, consider a situation where a user is able to create an invoice reminder while they are being migrated. If this reminder is created in the old system while their data is being copied, it’s possible that this reminder won’t be copied to the new system during the migration. It would be as if the operation had never happened in the first place. Business owners trust Wave to be extremely careful with their data, as our software has a huge impact on the financial well-being of their businesses. This class of problems would only become more severe as we migrated more important features, so we wanted to take this smaller instance of the problem seriously.

The easiest and safest way to solve this problem was to effectively ‘pause’ the user’s Wave account for the duration of their migration. Again, this is more complicated than it first sounds. In a fully-featured distributed system like Wave, safely pausing a user’s account can require changes at multiple sites, such as:

Displaying a splash-screen to the user in the web or mobile app during the migration.
Stopping or delaying any relevant background jobs that could affect Invoicing.
Refusing API requests from 3rd-party applications that may be trying to change the state of the user’s invoicing data.

This multi-site issue was another important reason for us to make it easy to determine what stage of the migration a user was in.

Wave has an internal system called “Identity” that maintains a user’s authentication and authorization state for the entire distributed system. Each request to the Wave API results in an internal RPC call to Identity to fetch the user’s login state.

We used this system to track the status of a user’s migration, and to make decisions in our application code to determine what course of action to take at each relevant site. Diagram 1 illustrates how this all works together.

*Diagram 1: The hybrid Wave Invoicing application architecture with new invoice reminder system*

Migrating The Data

We’ve been using the term “migrate” without defining exactly what it means. The outcome of migrating a user is that they move from using the legacy system to using the hybrid new/old one; however, the process required to do this is data-centric.

It requires:

Copying the source data from the legacy database.
Transforming the data from the old schema to the new one.
Verifying the correctness of that data in the new system.

Copying And Transforming

We had two concerns in the copy/transformation step.

First, we did not want to disrupt any of our production traffic. As we described in the previous section, the migrating user would be locked out of Wave for the duration of their migration, but at any given time there are many, many other business owners using Wave to operate their business. Reading and writing large amounts of data on a production database can impact its responsiveness and cause general service degradation.

In order to reduce the burden on the source database, we instead used a read replica that doesn’t serve production traffic. This option was unfortunately not available to us on the destination database, since it also used a typical leader-follower setup where writes needed to be made to the leader.

Second, we wanted to make sure that each user’s data was moved atomically: That is, either all of their data would be migrated, or none of it would be. We did this by running the data migration in a single atomic database transaction.

We felt confident in using this method for invoice reminders because the data volume was small. However, holding a transaction open on the new system’s primary database for the duration of a user’s migration concerned us, because we weren’t sure what sorts of issues might arise when data grew larger in later stages of the project. User activity on this primary database would also increase as the project continued.

We ended up carrying forward these uncertainties into later stages of the project rather than sorting them all out in this first phase. In a later post, we’ll discuss how these risks evolved as we made more progress.

Verification

Wave has been through a large migration or two before, and we knew from experience that there are many subtle errors that can arise when copying and transforming data from one system to another. We wanted to be very sensitive to these errors, because they had the potential to cause hard-to-diagnose bugs that shake the confidence of our users.

The data we were migrating at this stage was fairly simple, so a direct record-level comparison would likely have worked well. However, we hoped to find a solution now that could better scale to verifying the more complicated nested entities that we’d be migrating later.

We implemented this comparison by hashing. Before migrating the source data, we converted it to an intermediate format–a Python dictionary that contained all of its relevant fields. We hashed this data structure to produce a bytestring–called a digest–that could be passed onto the copy phase.

After the data was copied, we read it back from the destination database, converted it to the intermediate format, hashed it, and compared its digest to that of the source. This was a relatively efficient way of verifying that the source and destination data matched.

While the hashing operation itself is efficient, this step did have a notable cost in that it had to read all of the migrated data back into memory during the verification step. This did not consume much time, but it did have a significant I/O cost, which can cause issues in hosted environments where I/O is metered. We’ll talk more about this in a later post.

The tradeoff later proved to be worth it; the verification caught several errors later in the migration, such as fields that we had missed in the copy operation.

The Migration Engine

The larger outcome of this project increment was a migration engine that orchestrated the actions described above.

The engine was primarily made up of command-line tools and asynchronous workers that would serve us well in the next, larger effort. The tools were built to migrate cohorts of users in parallel.

These tools could:

Add users to a cohort (from a text file or command-line parameter).
Launch a migration for a cohort.
Display the status of a cohort (how many users were not started, queued, being copied, being verified, completed).
Roll back a migration for a subset of users.

We organized the migration process into discrete jobs. Each job was equipped with a dedicated queue so that we could inspect what was happening at each stage of the migration.

As each user moved through the migration process, we updated the state of their migration in a migration tracking database table.

Diagram 2 shows how the migration engine fits into the overall architecture of both systems.

*Diagram 2: System architecture including migration engine*

Bringing It All Together

We’ve discussed the relevant components and the larger challenges we faced in building the first version of our migration system. To get an idea of how everything fit together, let’s see how it ran a migration beginning to end:

Given a user to migrate, we would:

Inform Identity that we were beginning a migration for a given user
Read from a read replica of the database on the source system
Create a hash digest in a deterministic way for the source data
Start a transaction on our destination database
Transform and copy the user’s data to the destination system
Read the destination data back into memory, hash it, and verify the digest matches that of the source
Inform Identity that the migration completed successfully for this user

At this point, the migrated user would be using our invoicing product as a combination of old and new!

Success!

After we built the new invoicing reminder system, we diverted all new Wave signups to it. This let us validate our new hybrid system with a small but growing set of users.

Once this version of our migration system was complete, the migration process proceeded without any problems. Our framework had experienced its first battle test, and victory was ours! This relatively short project made us confident that we’d be able to tackle the larger migration ahead of us, which we’ll talk about in our upcoming post.

Co-authored by Jason Colburne and Emmanuel Ballerini, edited by Michael DiBernardo. Thanks to Joe Pierri, Nick Presta and Joe Crawford for the reviews.