Data Pre-processing
I have tried to provide the information on data pre-processing. It plays a major role in machine learning model development. Data is very important to develop a model but it should be well processed.
Data plays a very important role in the development of machine learning models. Data is everything for machine learning models and we can’t even imagine machine learning without data. About 2.5 quintillion bytes of data generating every day. We can use this data for the development of machine learning models. But the data is not well processed to use it directly for the development. It is raw data and we have to preprocess it and make it useful for the development of models. Before going for the methods to pre-process the data, I would like to talk some words on the problems and what exactly the impurities are there in raw data.
Impurities In Raw Data
Let’s first see the impurities in raw data and then we will see the methods to sort out the impurities.
- There may have empty data for the particular record. That is there will be no value in the particular instance. The null values will make a big issue during the development of machine learning models. Of course, these are very informative for us but not for the calculation of the model. So, we have to manage the empty or null values or records. We will see the solution for this later in this post.
- The dataset provided to train the model may have non-numeric data. We know that the machines can deal with only numerical data and they do not understand the non-numeric data. There is more than one method to convert the non-numeric data into numeric data. We will see it in this post.
- The dataset may have unuseful data such as name, serial numbers, etc. We have to remove them from the dataset or have to avoid taking them for the training of a model.
- There may have a large range for a particular attribute and a very small for another one. Sometimes it will create a problem while making a decision or making predictions. So, we have to convert all data into the standard range such as 0 to 1 or -1 to 1. We will see the methods to deal with this problem.
So, these are some problems with the raw data and we can’t use this type of data for the development of a machine learning model. Let’s start with the solutions for the above problems one by one.
Data Pre-processing
I am using a titanic dataset to explain the pre-processing of the data. We will require the Pandas library to make changes to the dataset.
Here I have imported the Pandas library and import the titanic.csv file using the Pandas library. We have seen the head() method in the last post also. It will return the first 5 records from the dataset.
Let’s start one by one.
1) Useless attributes:
The dataset may have attributes that are not useful to develop a model. So, we will just remove them from the dataset and we are good to go further. Also, we can put the input data in one variable and output data in another variable. This will also solve the problem.
As the above image (Titanic dataset) shows, there are some attributes such as PassengerId, Name, Ticket, Cabin, etc. which are not useful to develop a model. So, we can just remove them.
There is a drop() method that is used to remove the attributes from the dataset. Also, we can put the input data in one variable and output in another without editing the imported dataset.
I hope this problem will solve by doing anyone of the above.
2) Null values:
If the dataset having null values in it then it will make a major issue in the development of models. We have to manage these null values. Let’s see the methods to manage the null values.
The info() method gives the information about the null values and data types of all attributes which are present in the dataset. You might have seen len() method of python. It will return the number of records in the dataset. From the above dataset, we can clearly see that Age, Cabin, and Embarked have some null values. Here cabin is an unused attribute, so we will not take care of it.
- We can just remove that particular record from the dataset and we can solve this problem. But this method is not recommended as it will lose the information.
- If the attribute of that particular column or attribute is numeric and discrete, then we can put the mean value in the empty instances. This will solve the problem of the numeric and discrete types of attributes. As per the above image, there are null values in the Age attribute. So let’s fill the mean value of age into the null instances.
The fillna() method is used to fill the value at null instances. We will fill the null values with a mean value of the Age attribute and again run the info() method to check the null values. And, we can see that there is no null value in the Age attribute.
- If the data in the attribute is numeric but categorical, then we will put the value of that category that has a high frequency in the dataset. Generally, the non-numeric data is categorical. So, we can use this method to solve the problem of non-numeric, categorical data. Let’s try to manage the null values of Embarked attributes.
The value_counts() method will count the categorical values of that particular attribute. We will fill the null values with the class which has the maximum count.
3) Non-numeric data:
There may have useful data but it is in the non-numeric form. We know that the machines can understand only numeric data. So, we have to convert the non-numeric data into numeric data. To do this, there is more than one method we will see on by one. In the titanic dataset, we have a Sex and Embarked attribute in non-numeric form. Let’s try to convert them into numeric.
- Using the map() method we can map the non-numeric data with numeric data. Let’s see
As we can see in the above image, the map() method takes a dictionary with the original non-numeric value as a key and the value with which we want to replace it as a value.
- Another method by which we can do the same thing is the replace(). It works as same as the map() method.
We can use the LabelEncoder method which is in the sklearn library. It will replace the values with numbers starting from 0 in alphabetical order.
By alphabetical order, the female comes first, and hence it will replace female with 0 and male with 1.
- get_dummies() method will make a separate column for each category of the attribute. Let’s see how this method will work.
It will add the columns for each category and put the value as 1 if that category is there and 0 in other columns. In the above example, if the male is there, then Sex_male has the value 1 and Sex_female has 0 value.
So, there are some methods to convert the non-numeric data with numeric. And, we can solve the problem.
4) Feature scaling means converting all features in the same range such as 0 to 1 or -1 to 1. It will help the model to train and also avoid overfitting. Also, there is more than one method to do this. Let’s see.
- MinMaxScaler(): MinMaxScaler() method will convert all numeric data in the range of 0 to 1.
it will work on only numeric data, since we are converting the values in a specific range, and non-numeric values do not have the range.
- StandardScaler() method will convert the numeric values in the range of -1 to 1. Its implementation is similar to the above method.
I hope, you will get the idea about StandardScaler() method.
These are the solutions to the above problems that are with raw data. So, before feeding the data to the model, we have to apply the data preprocessing methods and convert the data in the proper format, so the model can read it properly and train well.
Conclusion
I have given a brief on data preprocessing such as what problems are there with raw data? how to solve them? Also, have tried to provide the demo with code. So, anyone who reads this post can implement data preprocessing with full of confidence.
Code is available here on GitHub.
Thank you.
Originally published at https://omkarraut.substack.com.