Machine Learning Data Preprocessing steps with python

Abdul Rehman Wahlah
6 min readJun 12, 2020

--

Beginners guide for data Preprocessing

“Data preprocessing is a process of converting raw data into a clean data set.”

Data preprocessing steps plays a very crucial role in delivering quality and useful datasets to machine learning models. Because when we gather large datasets from different sources it is collected in raw format and it is not ready for analysis. However, if we forcefully give this raw data to machine learning models then it will end in poorly trained models with very low accuracy. So, we should always look into the data set and apply preprocessing steps before training models.

The following steps will be covered in this article:

1- Import Libraries

2- Import Data set

3- Identify and handling missing values

4- Encoding Categorical Data

5- Splitting Data set into Train test split

I will use google collab to implement these steps and paste the screenshots with code.

Data set:

Data set

1- Import Libraries:

import pandas as pdimport matplotlib.pyplot as pltimport numpy as np

Pandas is a fast, powerful, flexible, and easy to use open-source data analysis and manipulation tool, built on top of the Python programming language.

Matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy.

NumPy is a general-purpose array-processing package. It provides a high-performance multidimensional array object and tools for working with these arrays

2- Import Data set:

dataset = pd.read_csv(‘Write name of your data set file’)X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

It will read the data set file using the pandas' library and store data set in a variable. If you have data set in an excel file then you can use this function pd.read_excel().

In the next two lines, it will store all columns(dependent columns) from the data set except the last column (dependent column) in the X variable and the last column ( dependent column) in y variable.

Now if we print X variable it will show the output:

Extracted Independent variable

If we print y variable it will show the output as:

Extracted dependent variable

According to the above data set it will store the first three columns (Independent) into the X variable and last Purchased column (Dependent) into y variable.

3- Identity and handling missing values:

This is very important to identify the null values and more important is how you deal with them.

You can identify the number of null values by using the following code

dataset.isna().sum()#It will return the column names with the number of null values

for the above data set output will be:

There is 1 missing value in Age and one in Salary column

There are a couple of ways with which you can deal with null values like if the number of null values is more then 70% than you can drop the column. If the number is not big then you can fill the missing values.

If you have numeric data like age, salary, year, etc. Then, you can take the mean, mode, or median of that column and replace it with missing values of a particular column. There are many other ways to handle null values like Imputation. You can read them and use them according to the data set requirement.

As for this data set, we will use this method:

from sklearn.impute import SimpleImputerimputer = SimpleImputer(missing_values=np.nan, strategy=’mean’)imputer.fit(X[:, 1:3])X[:, 1:3] = imputer.transform(X[:, 1:3])

This will replace the missing values in the Age and Salary column.

These highlighted values are replaced by missing values

You can see null values are replaced by the highlighted values.

4- Encoding Categorical Data:

OneHotEncoding Example

One-hot Encoding is a type of vector representation in which all of the elements in a vector are 0, except for one, which has 1 as its value, where 1 represents a boolean specifying a category of the element.

For the data set we are using in this article we will write the following code:

from sklearn.compose import ColumnTransformerfrom sklearn.preprocessing import OneHotEncoderct = ColumnTransformer(transformers=[(‘encoder’, OneHotEncoder(), [0])], remainder=’passthrough’)X = np.array(ct.fit_transform(X))

It will convert the first column ( Country ) into three different columns.

Now there are three columns from one country column. The first is for France, second for Germany, and third for Spain. Look at the first column it will mark 1 in the row where the country was France, for column two it will mark one on those rows where the country was Germany, and for the third column, it will mark 1 where the country was Spain.

Data set

If we will apply the above code to our example data set then it will convert the first column (Country) into three columns and when we will print the x variable the output will be as follow.

Now there are 5 columns of the dataset

And this is how we will Encode the dependent variable:

from sklearn.preprocessing import LabelEncoderle = LabelEncoder()y = le.fit_transform(y)

If we will print y variable then we will see the following output:

0 = ‘No’ and 1 = ‘Yes’

It will convert our target column ( Purchased ) into 0 (No) and 1 (yes).

5- Splitting Data set into Train test split:

Splitting Data set into training and testing

This image will explain enough for the Train test split so I will give you a very quick review on this.

Train/Test is a method to measure the accuracy of your model. It is called Train/Test because you split the data set into two sets: a training set and a testing set. We usually split the data around 20%-80% between testing and training stages. Under supervised learning, we split a data set into training data and test data in Python ML.

This is how you can implement it on your data set:

from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

Simply import the train_test_split and implement it by giving the parameters. X and y are variables that hold training and testing data whereas test_size parameters decide at what percentage you want to split your data. As per the above code ( test_size = 0.2), it means 80% data set will be given to training the model and 20% will be saved for testing the trained model.

Note: There is one more step which is feature scaling and it is not always required as for data set we were using in this article it is not required.

I hope this will be helpful for you by using this you can apply to preprocess steps to any data set as per requirements.

Best Regards,

--

--