A principal part of data engineering is to provide valid data to users. When ingesting data from APIs, files, or other data feeds, it is essential to check that the data conforms to what you expect.
It is difficult to trust an application if it supplies questionable data. Same goes for a predictive analysis that relies on questionable data. Such is the saying “garbage in, garbage out”. When we have data verification checks in place, we can maintain (with confidence) that our data at scale is good (and getting better) rather than considering quality as an afterthought.
Let’s dive into…
When we talk about data from a software engineering perspective, we often talk about how big it is: the number of events (volume), the rate at which it is generated (velocity), the different formats it comes in (variety). These terms are especially helpful for boasting about the power of our machines.
But, more important than all of those things is the actual usefulness (value) of the data to the people consuming it. In other words, the power of humans to drive insight and perform actions as a result of the work the machines are doing.
In our case at a…
“data” person