Data Preprocessing: where it all begins.

MLamine Guindo
unpack
Published in
4 min readMar 15, 2021

If today we’re talking about machine learning, it’s mainly due to the amount of data we generate. Without data, machine learning would not exist. If both were humans, we would say there is a love story between them.

Before cooking, we preprocess our Food(Washing, removing some parts…)

The problems

(there is no love story without problem :P )

Real-world Data is gathered from multiple and disparate sources; therefore, it doesn’t come in the way we expected it. The obtained data is often inadequate, inconsistent, and may contain many errors.

The solution

To deal with those problems, we call the data preprocessing technique a tool for turning unclean data into a clean data package.

In Machine Learning, data preprocessing refers to cleaning and sorting raw data to make it appropriate for constructing and training Machine Learning models. In basic terms, data preprocessing is a data mining technique used in Machine Learning that converts raw data into a readable and understandable format.

Why is data preprocessing important?

In 2014, New York Times published an article entitled “For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights” where they mentioned,

“Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.”

https://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html

This statement shows the importance of data preprocessing as it takes 80 % of data scientists’ works. In fact, the performance of the model relies hugely on how good the data is?

Let’s see the different techniques of data preprocessing

Data preprocessing includes several techniques such as data cleaning, integration, transformation, and reduction.

Data cleaning is the starting point of data preprocessing methods. It helps to detect missing values, remove noise data, detect outliers, and correct irregularities. When the data is dirty, it will negatively impact the performance.

Data integration: This approach operates by integrating(combining) data from many diverse resources into a single coherent data store. It can include several databases, directories, or data cubes.

Data transformation: it helps to transform the data into a suitable form. It is composed of many sub-techniques such as:

· Smoothing: it removes noise from data

· Aggregation: It is a method of summarizing data by using statistical measures such as means, median, and variance.

· Generalization: It entails using hierarchical concepts to replace lower-level (primitive) data with higher-level data. For example, a region, which is a categorical attribute, may be replaced with country, which is high-level words.

. Normalization: This approach adjusts the data values to a given range, such as 0–1 or -1–1. This approach is helpful for techniques like artificial neural networks, classification, and clustering.

Data reduction: This approach can be used to reduce the size of a dataset’s representation while maintaining the original dataset’s structure. As a result, by using mining techniques on the reduced data, better data results can be achieved.

Example of preprocessing an image

1-Delete corrupted files

2-Resizing the image in order to get the same size

3-Scaling the image

4-Normalize the image

6-Reduce the dimension

Note: This is just an example, it does mean your data must follow the same path.

References:

https://www.researchgate.net/publication/319990923_Review_of_Data_Preprocessing_Techniques_in_Data_Mining

https://becominghuman.ai/image-data-pre-processing-for-neural-networks-498289068258

https://towardsdatascience.com/data-preprocessing-and-network-building-in-cnn-15624ef3a28b

https://machinelearningmastery.com/best-practices-for-preparing-and-augmenting-image-data-for-convolutional-neural-networks/

--

--

MLamine Guindo
unpack
Writer for

I am GUINDO, passionate abou data science ,machine learning ,spectroscopy , chemometrics, connect with me