Validating Data Migrations — time to dust off your statistics

How to use statistical validation when normal methods won’t work

Roman Soboliev
Slalom Build
6 min readJan 10, 2022

--

The prospect of data migration can be daunting, and for a good reason. The process is complex, risky, and failing to properly migrate the data can mean severe consequences for the business. As more companies embark on the journey of infrastructure modernization, retiring expensive and cumbersome legacy applications and moving to the cloud, this challenge is becoming more common. And with migrating years and years of data comes the necessity to dig through our toolbox for the correct validation tool for the job.

Occasionally we get lucky. Sometimes the schema is straightforward, the dataset is manageable (a couple of billion rows), and legacy and modernized applications will be live side-by-side for comparison, allowing us to brute force our way through record-by-record validation and cross-reference new reports against legacy.

But not always.

It’s hard to pinpoint the origin of the phrase “Hope is not a good strategy.” Depending on the source, it can be movie director James Cameron, economist Dr. Benjamin Ola Akande, Green Bay Packers coach Vince Lombardi, President Barack Obama, even Darth Vader himself.

What is clear though is that we can’t rely on the hope that we get lucky and avoid the myriad “what if” scenarios that can crop up. Like what if the dataset is so large that brute force is not an option? Or what if the schema is convoluted with decades worth of “temporary patches?” Better yet, what if it’s both? Ouch.

Luckily, when normal data validation methods won’t work, we can leverage statistics to heuristically validate the success of data migrations. So there’s no need to panic—unless you’re running low on coffee, of course. If so, brew a fresh pot, and let’s get to it!

Central Limit Theorem

Caffeinated? Good. The next part is important, but can be a bit dry. It lays the theoretical foundation for the proposed approach. This is also the part where I will make my calculus professor proud. (Hello Mrs. Efremova! See, I was paying attention!)

We will use Wikipedia and my notes from almost two decades ago to define the Central Limit Theorem (CLT). CLT establishes that in many situations, when independent random variables are added, the properly normalized sum tends towards a normal distribution—even if the original variables themselves are not normally distributed. The theorem is a key concept in probability theory because it implies that probabilistic and statistical methods that work for normal distributions (a bell curve) can apply to many problems involving other types of distributions.

The above is important because we will rely on normal or Gaussian distribution when making key assumptions — specifically, the folded normal distribution (or half-bell curve).

Gaussian (or normal) distribution

Are you still with me? In other words, random events tend towards a bell curve. The scientific community widely accepts this statement. And if it’s good enough for NASA to send people to space, it’s good enough for us.

Selecting statistically significant sample

The theory above allows us to examine a statistically significant sample to draw conclusions about the entire dataset. What does “statistically significant sample” mean? Glad you asked. For our purposes, we’ll focus on three essential parts: First, the sample size must be appropriate. There are plenty of tools on the Internet to help select the statistically significant sample size. Next, we want to get the highest confidence level (99%) and the smallest margin of error (1%) possible. According to the tool, to validate one million records we will need to sample a random 14,220 of them. The “random” part is important, as it’s not enough to select the top 14,220 records.

Finally, the sampled dataset needs to be homogeneous. For instance, migrating a system with sales and current inventory data requires selecting a representative random dataset from both subsets based on their sizes. If sales data has one million records and current inventory data has one million records, we will need to select 14,220 records from each to perform validation.

Not all data is created equal

Now that we established the why and how of selecting our statistically significant sample, let’s talk about the dataset itself. Intuitively, we understand some data has more significance for business than others. For instance, a social security number is more important than the last log in.

Or is it? As engineers, we don’t always have the business insight to say definitively one way or another. Hence we need help from stakeholders to identify business-critical data or Unit of Migration (UoM). In fancy speak, UoM is a top-level entity and all its dependent entities. Confusing, right? In practical terms, UoM is a subset of data we can’t afford to lose. And even more complicated, sometimes UoM data can live across multiple tables.

Let’s imagine we are migrating a database with the following columns: First Name, Last Name, Social Security Number, Address, Signup Date, Last Login, and Notes.

In conversation with business stakeholders, we discovered that First and Last names, SSN, and Address are critical. Migration won’t be successful if we lose a single record in any of those columns. Hence, these are what make up our Unit of Migration. It’s important to note that data loss in any of the UoM elements would result in a rollback of the entire UoM. For instance, if we fail to migrate the First Name, we’ll have to roll back all four UoM columns for the record in question. That’s essential to keep in mind when UoM spans across multiple tables. Conversely, loss of Notes is something we can tolerate, and as a result, rollback won’t be necessary.

Assigning weights

In simple terms, weight is the score we will use to determine where migrated records fall within the distribution. When assigning weights, it’s important to remember that the combined weight of non-UoM columns can’t be greater than UoM. We shouldn’t be rolling back records due to the loss of non-essential data. So, if we assign the weight of 1 to each column, the combined weight of UoM will be 4, and non-UoM will be 3.

With the normal distribution in mind, our post-data migration validation half curve would look something like this:

Data migration distribution

Post-migration validation should return records with a weight of four or greater for us to consider migration successful. All records with a weight of three and below should be examined, rolled back, and re-migrated.

Conclusion

Traversing the world of an ever-growing number of data migration projects with severe consequences for failure can be unpleasant. Central Limit Theorem (CLT) allows us to select a statistically significant sample within homogeneous data, define Unit of Migration (UoM) in collaboration with business stakeholders to examine the quality of the entire data migration, significantly reducing the complexity of validation.

However, statistical analysis is just a tool, and we are the artisans. So it’s our responsibility to select the correct one for the job.

--

--