Data product design — Data verification and cleaning

Bryan Yang
A multi hyphen life
5 min readFeb 3, 2023

Even if the rice is harvested, it cannot be eaten directly and needs to undergo a series of processing before it can be delivered to the consumer.

https://www.sciencedirect.com/science/article/pii/S1672630821000998

This process is not very different from the process of data processing. The common data-based processing processes include

Data Acquisition Model

When data is generated, we usually do not process the data in the place where it is generated, but “push” it to another place for processing through data acquisition. For example, after a user leaves interactive data on a mobile app, the cell phone pushes the data to the back-end server through the RESTFul API, which is the most basic method of data mining.

If there is only one user, of course, it can be handled in this simple way. However, as the volume of data gets larger, for example, to millions of users, a single machine usually cannot handle such a large volume of data, so the harvesting method needs to be changed to cope with it. A decentralized technology like Apache Kafka is used to collect data.

In addition to the active pushing of data out from the source as described above, there is also a “pull” model of harvesting.

https://community.xvigil.com/t/how-to-use-prometheus-to-monitor-complex-applications-and-infrastructure/1813

In addition to the active pushing of data out from the source as described above, there is also a “pull” model of harvesting.
https://ithelp.ithome.com.tw/upload/images/20210904/20141140SUUyqltqtl.png
(https://community.xvigil.com/t/how-to-use-prometheus-to-monitor-complex-applications-and-infrastructure/1813)

Prometheus is a tool for collecting data from machines. Machines that need to collect data can install the Prometheus Exporter, which can be considered as an external interface to the machine data.

Data verification
Although it is said to be “big data”, but not all data is worth collecting. There is a lot of “bad data” or “unclean” data in practice. If we let these bad data into the database, it will only affect the data quality.

Source verification
Since we want to receive data, the server that collects the data is usually open to the outside world, and it is very likely to be hit by other objects (such as crawlers or malicious attacks) besides the legitimate ones, and if we do not do the basic authentication, we may receive a lot of useless messages or even hang the server.

Field verification
Usually, when we collect data, we will first set the specifications (Spec) to be communicated, so as to facilitate the subsequent analysis and processing. In practice, sometimes the Specs are different due to different versions of the end-app or a typo during development, so basic field validation is needed.

Format Validation
For example, height is numerical data, Email is string data and needs to fit a specific pattern. When the format of these data is not correct, it will directly affect the subsequent processing.

The way to deal with verification failure
Although we do not need the bad data to enter the follow-up processing, but the bad data itself also has its own analysis value, such as analysis of who in the end is messing with the data, or whether there is a problem when the front-end re-sending data. Therefore, in some cases, we do not discard all the failed data, but import them to another data storage for subsequent analysis and observation.

Data Cleaning
After the data has passed the basic validation, it still needs to be cleaned before it can be used officially. The data cleaning process is very diverse and complex, so we can only introduce a few of the more commonly encountered scenarios.

Parsing
Data parsing refers to the method of splitting the original data and extracting the internal data. For example, dividing “address” into “city”, “area”, “street”, or dividing “sentence” into “word” and “lexical” (which is a basic operation of natural language processing).
https://ithelp.ithome.com.tw/upload/images/20210904/20141140zyvRtrOJvJ.png
(Multi-parser architecture for query processing. 2001)

Duplicate elimination
Duplicate values may be received when data is sent repeatedly due to network transmission or program bugs. Duplicate values do not look bad on the surface, but they can affect the fairness of subsequent calculations, so they need to be specifically removed. The removal of duplicates has its own difficulties, e.g.
How long is a duplicate value if it is the same? An hour, a day, or a month?

To what extent is a repetition a repetition? If the content of the event is exactly the same except for the time of occurrence, is it considered a repeated value?

Checking Outlier
Some data may look structurally legitimate, but may be abnormal when viewed as a whole, such as unusually high response times, user heights, and too many clicks. It is difficult to determine whether these data are real or abnormal on the surface, and they can only be determined through statistical methods or even visualization. For example, in my previous job, I was checking whether the company’s services were being used normally or were hacked or hijacked (hijack), and this time I needed to rely on some algorithms to detect outliers to observe.
https://ithelp.ithome.com.tw/upload/images/20210904/20141140jCs5LNPkWr.jpg
(https://medium.com/@praveengovi.analytics/outliers-what-it-say-during-data-analysis-75d664dcce04)

The above procedures may seem tedious and uninteresting, but it comes back to the old saying **”Garbage in, garbage out “** that data needs to be verified and cleaned to ensure data quality after collection.

--

--

Bryan Yang
A multi hyphen life

Data Engineer, Data Producer Manager, Data Solution Architect