Data Quality — An important data-aspect
As a data engineer, we are constantly dealing with data. Collecting, processing and possibly applying some business knowledge on top describes perfectly our everyday life.
As many Data Science (DS) memes are talking about though; your (DS) work/model is only as good as your data.
So what can we as data engineers do to make sure that our data is as good as it gets?
We need to apply some kind of data assurance/quality on top. We need to make sure that everything is — to the best of our knowledge — ready to be consumed by either DS teams or end users.
To tackle this problem there are many solutions, we will discuss from this point on, only on the ones I have experience with.
#1 Identify important properties
Before we dive into this, if you think about it, data quality issues have to do with the collected records’ properties. These properties fall under the following categories:
- properties you know beforehand,
- properties we should definitely identify,
- properties that are nice-to-have.
Keeping these in mind you can go on and design your ETL workflows.
#2 Apply your rules directly on your ETL workflows
(Unfortunately) our day to day work is not for the sake of CS; we are working on a specific sector, having strict requirements in terms of data.
Why not embed these requirements in your ETL workflows?
Let’s go with an example on this. Say you are building a news aggregator, there are some very important aspects in the data you are collecting, the most notable of which are:
- website that published the story,
- date the content piece was published,
- category under which it falls under,
- publisher/journalist name,
- significant/important words in the text.
Ok, so you have to identify (ideally) 5 properties in each of your data records.
Let’s categorize our to-be-identified properties in the ones we have defined.
Properties known beforehand: website publishing the story. We know this before our workflow starts.
Properties that should be identified: date of the story and possibly the category of the content piece. Regardless of the use case, chances are you should always include a published date in the data you are getting!
Concerning the category of the published article, most probably you need this, but whether it is a must or not is something dependent on the actual use case.
Properties nice-to-have: publisher/journalist name and significant words are considered nice to have. In some cases (as discussed previously) the category of the article is such a case too.
Having identified the important properties in you data, you should include these rules in your workflow(s). Every data record not validating given these checkboxes you should store them for future evaluation (we never throw away data!), there may be a bug in a script of yours!
#3 Apply human/manual review steps
(Unfortunately) manual review steps are necessary. Whether these are done by a CS expert or a domain one, they are needed.
Chances are you are going to need an admin environment for this. And you should definitely set one up from the start!
It is indeed an overhead in terms of go-to-market times, however you are for sure going to be glad you did.
Always include an admin review environment in your data project!
You may want to include a human expert in the field you are working on (optimal case), or you may want to easily review your collected data. Regardless, reviewing the collected data is going to identify problems.
Some to watch out for:
- specific data source problems,
- automations easily applied on the collected data,
- rules that if met lead to a data-quality accepted data record.
Each one of them can help you build an even more robust and data quality assured data stack/platform.
You can even use an open-source CMS for this, also enabling you to make your collected data available to anyone without writing a single front-end framework line of code!
In a nutshell (in case you quickly scrolled to the end of the article), you should always keep in mind data quality aspects in every data related problem you are trying to solve.
- Identifying important properties in the data you are handling,
- Applying domain specific rules on top, and
- Having an admin env to review the data.
are going to help in the robustness of the data platform and also offer the direct business view easier!
Over the next articles we will discuss specific technologies/frameworks that can help in the process!