ML Series: Day 6— Preprocessing (Part 1) — Elevating Data Quality for Effective Machine Learning

Cleaning level 1 and level 2

Ebrahim Mousavi
5 min readJan 22, 2024
Figure 1. Preprocessing

🔙 Previous: Nonlinear regression

🔜 Next: Preprocessing (Part 2)

Note: In this chapter, the techniques and methodologies discussed are based on the book ‘Hands-On Data Preprocessing in Python’ by Roy Jafari, which served as a valuable reference for understanding and implementing effective data preprocessing workflows.

Data preprocessing refers to techniques and procedures for preparing data for analysis before applying machine learning algorithms. These steps include data cleaning, data normalization, data reduction, Transforming and Organizing raw data into a format that can be easily used by algorithms. The purpose of data preprocessing is to improve data quality, reduce noise, and eliminate any inconsistencies or possible errors.

Data cleaning involves removing or correcting any errors, inconsistencies or missing values in the data. Data transformation involves converting data into a format more suitable for analysis, such as converting categorical data to numerical data or reducing the dimensionality of the data. Data normalization involves scaling the data to ensure it falls within a certain range, which can be important for some machine learning algorithms. Data integration involves combining data from multiple sources into a single dataset. Data reduction involves reducing the size of a dataset by selecting only the most relevant features or samples.

The process of data cleaning can be quite different from one project to another and it is not possible to provide clear step-by-step instructions on how to perform data cleansing, however, we can divide the data cleansing process into three levels as follows give:

Cleaning level one:

This level refers to the apparent cleaning of the data, and most of the available data do not need this stage of cleaning. We consider a data set at level one if the data set has the following characteristics:

- The data should have a standard structure.

- The titles of the columns should be clear and codeable.

- Each line of data has a unique identifier.

An example of this type of cleaning can be considered that we have several text files as shown in the figure below, and as seen in Figure 2, these data do not have a proper structure and should be consolidated in a standard format so that they can be analyzed.

Figure 2. Raw text data

In Figure 3, we converted the textual data into the following form in an integrated manner.

Figure 3. Integrating data to reach level one of data cleaning

In Figure 3, each line is assigned to a text file, and now it can be said that this data is at the first level of data cleaning.

Cleaning level two:

This level refers to Unpacking, Restructuring, and Reformulating tables. Unlike level one data cleaning, where the samples did not require special analysis, at this level of cleaning, analysis is often required. As mentioned before, the data in Figure 3 is at the first level of cleaning, and if we pay attention to the name of each line, we can extract the name of the city, month, and date that now if we want the second level of cleaning the data, and specifically opening the name of the data, Figure 4 is obtained.

Figure 4. Unpacking data names and adding different columns to the data

In the second level of data cleaning, more analysis can be done and for example unpacking the data content in such a way that the number of repetitions of four words [‘vote’, ‘tax’, ‘campaign’, ‘economy’] and add columns to the data according to that as shown in Figure 5 as the result of this analysis.

Figure 5. Unpacking the data content and adding different columns to the data

For further analysis, the different values of the four columns that were added can be visualized as shown in Figure 6.

Figure 6. Visualization of the number of repetitions of four words in different months

Conclusion

In level 1, we cleaned up the table without paying attention to the data structure or the recorded values. In level 2, our attention was to have a data structure that would support our analytic goal, but we still didn’t pay much attention to the correctness or appropriateness of the recorded values. That is the objective of data cleaning level 3. In data cleaning level III, we will focus on the recorded values and will take measures to make sure that three matters regarding the values recorded in the data are addressed. First, we will make sure missing values in the data have been detected, that we know why this has happened, and that appropriate measures have been taken to address them. Second, we will ensure that we have taken appropriate measures so that the recorded values are correct. Third, we will ascertain that the extreme points in the data have been detected and appropriate measures have been taken to address them.

In the next session: Machine Learning Series: Day 7 — Preprocessing (Part 2), we discuss Missing values, and Outliers, the core idea of cleaning in level 3.

If you like the article and would like to support me make sure to:

👏 Clap for the story (as much as you liked it) and follow me 👉
📰 View more content on my medium profile
🔔 Follow Me: LinkedIn | Medium | GitHub | Twitter

References:

--

--