Data quality is often neglected in the early stage of product development because a minimal viable or pilot product usually doesn’t require thorough data validation to justify its functionalities. Even in a mature and well-tested software product, it is still often overlooked because the code test coverage doesn’t reflect the coverage of possible input to the system, and the output resulting from most of the input combinations is left unverified. That would not be an issue only in an ideal world, in which the data we collect (by humans and machines) or our clients provide are clean with only expected values, but real life is full of human errors and signal noises. People don’t always get the spelling right and systems glitch or have its limit of countering random factors.
Data wrangling is often called to the rescue, which processes data before they are fed to the system. The dirtier your data are, the more human-eye inspection is needed at this step, and because human involvement is nearly indispensable, data wrangling is usually in the preliminary stage before the data go into the data pipeline. On the other end of the pipeline, as an addition to the software quality assurance, which conventionally focuses on the functioning of the software under different circumstances, but not on the logic behind the data derived by the system, data quality assessment, or data tests, can be helpful. For example, tools like Great Expectations put data quality under the microscope with data profiling, testing, and reporting. dbt also provides testing besides data modeling. While we are glad that these tools are catching the data errors for us, we still need to fix them by spotting the causes in the data pipeline. That is where things get tricky. The more complex your data processing is, the more difficult for you to spot the errors. After all, you don’t have a compiler to tell you which line of the code causes the wrong resulting data.
Sam Bail shows in the link how you can test your data every time you hand off the data between teams or systems with data pipeline tooling such as Airflow. I would go even one step further to test my data after every logical data processing step. For example, my data are from multiple sources, and they may duplicate or conflict with each other, and my de-duplicating and merging logic are complex. What I do is to add a testing step after de-duplication to verify resulting data are as expected, before the pipeline goes to the merging step and another testing. Saying I find duplicate data after the merging step, I can then assume the logic error to lie in the merging step and not in any earlier ones, given my previous testing is solid. The beauty of it is, data quality assurance is tightly integrated into the data pipeline by running “data unit testing” after each step. If you put data wrangling into the picture, the first step of the data pipeline should be testing the input, i.e. the outcome of wrangled data. This model even allows test-driven development at the data level (if you like it.)
Does this sound interesting to you? Alpas AI is looking for a data analyst to work with us to enhance data quality in our global supplier search platform. This role is in close collaboration with our business domain and data engineering, and we see it highly crucial to our mission to make the supply chain transparent for our clients. Feel free to reach me/us to know more. :D