Data quality is often neglected in the early stage of product development because a minimal viable or pilot product usually doesn’t require thorough data validation to justify its functionalities. Even in a mature and well-tested software product, it is still often overlooked because the code test coverage doesn’t reflect the coverage of possible input to the system, and the output resulting from most of the input combinations is left unverified. That would not be an issue only in an ideal world, in which the data we collect (by humans and machines) or our clients provide are clean with only expected values, but real life is full of human errors and signal noises. People don’t always get the spelling right and systems glitch or have its limit of countering random factors.

Data wrangling is often called to the rescue, which processes data before they are fed to the system. The dirtier your data are, the more human-eye inspection is needed at this step, and because human involvement is nearly indispensable, data wrangling is usually in the preliminary stage before the data go into the data pipeline. On the other end of the pipeline, as an addition to the software quality assurance, which conventionally focuses on the functioning of the software under different circumstances, but not on the logic behind the data derived by the system, data quality assessment, or data tests, can be helpful. For example, tools like Great Expectations put data quality under the microscope with data profiling, testing, and reporting. dbt also provides testing besides data modeling. While we are glad that these tools are catching the data errors for us, we still need to fix them by spotting the causes in the data pipeline. That is where things get tricky. The more complex your data processing is, the more difficult for you to spot the errors. After all, you don’t have a compiler to tell you which line of the code causes the wrong resulting data.

Sam Bail shows in the link how you can test your data every time you hand off the data between teams or systems with data pipeline tooling such as Airflow. I would go even one step further to test my data after every logical data processing step. For example, my data are from multiple sources, and they may duplicate or conflict with each other, and my de-duplicating and merging logic are complex. What I do is to add a testing step after de-duplication to verify resulting data are as expected, before the pipeline goes to the merging step and another testing. Saying I find duplicate data after the merging step, I can then assume the logic error to lie in the merging step and not in any earlier ones, given my previous testing is solid. The beauty of it is, data quality assurance is tightly integrated into the data pipeline by running “data unit testing” after each step. If you put data wrangling into the picture, the first step of the data pipeline should be testing the input, i.e. the outcome of wrangled data. This model even allows test-driven development at the data level (if you like it.)

Does this sound interesting to you? Alpas AI is looking for a data analyst to work with us to enhance data quality in our global supplier search platform. This role is in close collaboration with our business domain and data engineering, and we see it highly crucial to our mission to make the supply chain transparent for our clients. Feel free to reach me/us to know more. :D




Data science engineer at home and Brazilian jiu-jitsu fighter at Checkmat Berlin

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Economic and demographic data — Podcast

Neural Network Its internal functioning and uses

Getting started with Neo4j and Gephi Tool

Contextualizing models in a non-tech organization

Always Believe in…Gold!

Climate-Related Legislation/Policies and the Happiness of Countries Around the World

Northern California fires caused by self-created climate change

Partition and cluster BigQuery tables with Airbyte and dbt

Improving Diversity through Recommendation Systems in Machine Learning and AI

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
I-Feng Lin

I-Feng Lin

Data science engineer at home and Brazilian jiu-jitsu fighter at Checkmat Berlin

More from Medium

Outgrowing Postgres? Keep using Postgres!

How to efficiently load data to memory

A unified streaming and batch table storage solution built on top of the Apache Spark better than…

Data Validation at Scale with Spark/Databricks