The Data Kitchen
Where does Quality Engineering fit in a Data Engineer’s World?
Forrest Gump sure was right about those chocolates: you never know what you’re going to get. That untouched, slightly strange looking chocolate just might be the one that changes the trajectory of your career.
A year ago, I was handed a chocolate that I politely accepted, not knowing that this fairly misunderstood flavor would become my area of expertise. That flavor turned out to be Quality Engineering for Data Engineering (yum!).
Cooool…so what is it?
Good question. Great question.
Well…for software, quality engineering is a well-understood and vital practice — simulating real-world interactions, verifying that everything in scope has been built, running test scripts, and generally trying to break things. So why is QE for data engineering so much harder to pin down?
Data engineering projects can be wildly different from each other—in data sources, targets, tools, complexity, you name it—which means that QE also will look wildly different. From project to project, the QE might go from testing a straight-forward transfer of data to the cloud, to testing the complete overhaul of a data warehouse and business logic, to testing a real time pipeline with complex partitioning. In projects like these, variation in scope and terminology among involved parties can lead to confusion, delayed releases, and unnecessary work/rework.
The demand for data engineering projects is growing like crazy. (I get it, data is the best) So to keep up with growth, it is important to solidify definitions, which is why I’m writing to differentiate two big concepts.
Quality Assurance and Data Quality
To do so, let’s take a trip to Analogy-Land and discover the differences!
Your client just signed the Statement of Work to design and build a brand new kitchen for their restaurant. They have their menu set, and are eager to get started!
The architects have spent time analyzing and planning the best layout and tools, and the developers build out one part of the kitchen at a time. In the end, when all is said and done, how will you know that the kitchen your team built does what the client needs and will continue to do so?
Enter the Quality Engineer. (that’s you!)
What are your options?
Door #1: Quality Assurance—functional testing of the ETL pipeline.
As the devs start to build out sections of the of the kitchen, you take time to scope it out and see what all is there. What’s on your list to check? Which parts will need to be tested at the same time? Which ingredients will you need to be sure to bring? (You see an industrial dough mixer — better bring flour and yeast!)
For Quality Assurance, it’s your job to make sure everything the team has built works as intended. Yes, you will test that all the burners work, make sure each knife is sharp, and check that the sink has hot water. But you also need to start a fire to check that the fire extinguisher and vents work, spill water to make sure the floor has good drainage, and lock someone in the walk-in to verify that it is also a walk-out.
If you know the client is an Italian restaurant, you’ll want to make sure they have good sturdy pots that handle continuously boiling pasta water. If you know the client is a pub, you want to make sure those fryers are still working in tip-top shape at 1AM.
The more you test in that specific kitchen, the more familiar you get with how everything works together, which, in turn then becomes easier to test. In the end, when you release their restaurant into the wild, you can feel confident that every item in the kitchen works as intended.
Plot holes:
- It’s true, you didn’t make any of the real menu items exactly as they will be made. But if you used a juicer to juice a lemon, you can reasonably expect it to work just as well on a lime.
- You made sure that every knife works the way it should, but you didn’t check that every knife that the client needs is there. That responsibility belongs to the architects. If you know fish is on the menu but you don’t see a filet knife, you can, and should, ask “Are you sure they don’t need a filet knife?” However, it’s up to the architects and clients to agree on which tools are required in the kitchen.
Door #2 : Data Quality—validation of consumption-ready data.
The client has provided you with all the ingredients and recipes for every menu item. You go through each recipe and prep it, cook it, taste it, and validate that it’s good.
It is very important, though, that the definition of a “good” outcome has been decided and communicated by the client beforehand. If you like to peel cucumbers completely, but they prefer to leave some skin on, this could affect the outcome of the dish. Which means they need to tell you their preference.
The best situation is having the old kitchen next door. In this case, you can make a dish and then immediately check it with the same dish made in the old kitchen. It makes the testing process much easier, but not every client has this available.
The goal in Data Quality is to make sure that all the dishes served in the restaurant have the raw ingredients they need, are cooked correctly, and taste good. You are using their ingredients to cook all of their dishes.
Plot holes:
- If none of the current recipes call for citrus zest, you won’t know if the citrus zester works correctly until there is a recipe that needs it. Maybe someone actually got a microplane instead of a zester — no one will know until much later if there is a mismatch.
- Clients have a lot to do and don’t always have time to determine the passing result of a dish. If you’re reliant on their “OK” for every iteration of every dish, this will increase the time it takes to fully test everything.
- It can be overkill to make sure that 2 billion rows of potatoes can be cut by the French fry cutter. Unless you’ve been specifically asked to test large volumes, the majority of performance testing should be done separately from functional testing. In some cases, it is done during UAT (user acceptance testing) when the client tests out the restaurant themselves. But again, this should be agreed upon ahead of time.
Question.
If a dish turns out bad, is it because of the ingredients, the recipe, or the kitchen? The point of Quality Assurance is to assure that the kitchen and its tools are not the problem. If you made a cappuccino that turned out great and then the client made a latte that turned out not so great, we can say with certainty that it wasn’t the espresso machine that led to that outcome. It could be that their milk had spoiled, they used too many coffee grounds, or their coffee-to-milk ratio was wrong. But, if the espresso machine can make a great cappuccino, it’s expected that it should also make a great latte.
Secret Door #3: Quality Assurance and Data Quality!
In the ideal world, you test everything in the kitchen using your own ingredients and you create a process allowing the client to check all food before it is served in the restaurant. This can be things like automatic alerts when raw ingredients spoil, continuous temperature checks for steaks, a timer for every pan, and a dedicated fry taster.
Landing back in Reality Land, let’s take a reality check:
For Quality Assurance, a QE would verify that the pipeline built by the devs and architects works as intended. For an ETL pipeline, they’ll need to review the transformation logic, create raw test data to check all aspects of that logic, run raw test data through the pipeline, and see if the output data matches what is expected.
For Data Quality, a QE would make sure that the actual raw data run through the transformation logic creates the expected production data. It is paramount that the QE knows exactly what is expected of the output data — all the rules and all the expected values.
Because Quality Assurance and Data Quality mean different things, when starting a project, it should be solidified by everyone what is in scope.
- Is it purely validation that the ETL pipeline functions as intended?
- Does it need to include data validation?
- Is Quality Engineering expected to be integrated with CI/CD?
- Will the Quality Engineering testing framework be passed to the client?
In the end, it’s understandable that data pipelines do not start out perfect because we humans are not perfect, but that’s why well-defined, nuanced, and comprehensive Quality Engineering is so important. When we deliver the product to the client, we can be sure when they untie the bow and reach in, there are no bugs, dropped columns, or transformation errors—only caramel crunch, vanilla buttercream, and chocolate peanut butter swirl.
¹Tiia Monto, CC BY-SA 3.0 <https://creativecommons.org/licenses/by-sa/3.0>, via Wikimedia Commons / altered from original
²Spheeris, Penelope, Lorne Michaels, Mike Myers, Bonnie Turner, Terry Turner, Dana Carvey, Rob Lowe, Tia Carrere, Brian Doyle-Murray, Lara F. Boyle, de S. T. Van, Malcolm Campbell, J P. Robinson, Pat Tonnema, and Gregg Fonseca. Wayne’s World. , 2001.