Data Cleaning and Preprocessing

Ujjawal Verma
Analytics Vidhya
Published in
10 min readNov 19, 2019

--

Data preprocessing involves the transformation of the raw dataset into an understandable format. Preprocessing data is a fundamental stage in data mining to improve data efficiency. The data preprocessing methods directly affect the outcomes of any analytic algorithm.

Data preprocessing is generally carried out in 7 simple steps:

Steps In Data Preprocessing:

  1. Gathering the data
  2. Import the dataset & Libraries
  3. Dealing with Missing Values
  4. Divide the dataset into Dependent & Independent variable
  5. dealing with Categorical values
  6. Split the dataset into training and test set
  7. Feature Scaling

1. Gathering the data

Data is raw information, its the representation of both human and machine observation of the world. Dataset entirely depends on what type of problem you want to solve. Each problem in machine learning has its own unique approach.

Here i am sharing some website with you to get the dataset :

  1. Kaggle: Kaggle is my personal favorite one to get the dataset.
    https://www.kaggle.com/datasets
  2. UCI Machine Learning Repository: One of the oldest sources on the web to get the dataset.
    http://mlr.cs.umass.edu/ml/
  3. This awesome GitHub repository has high-quality datasets.
    https://github.com/awesomedata/awesome-public-datasets
  4. And if you are looking for Government’s Open Data then here is few of them:
    Indian Government: http://data.gov.in
    US Government: https://www.data.gov/
    British Government: https://data.gov.uk/
    France Government: https://www.data.gouv.fr/en/

2. Import the dataset & Libraries

First step is usually importing the libraries that will be needed in the program. A library is essentially a collection of modules that can be called and used.

And can be import the libraries in python code with the help of ‘import’ keyword.

Importing the dataset

Loading the data using Pandas library using read_csv() method.

Here we have data in csv format, there is many kind of file can be read by using pandas library as shown below:

3. Dealing with Missing Values

Sometimes we may find some data are missing in the dataset. if we found then we will remove those rows or we can calculate either mean, mode or median of the feature and replace it with missing values. This is an approximation which can add variance to the dataset.

#Check for null values:

we can check the null values in our dataset with pandas library as below.

With the help of info() we can found total number of entries as well as count of non-null values with datatype of all features.

we also can use dataset.isna() to see the of null values in our dataset.

But usually we work on large dataset so it will be a good thing to get the count of all null values corresponding to each features and it will be done by using sum().

As we can see ‘Age’ and ‘Salary’ containing null values.

#Drop Null values:

Pandas provide a dropna() function that can be used to drop either row or columns with missing data. We can use dropna() to remove all the rows with missing data.

We can see before there is 4th and 6th index have null values. Now as above we can see both rows with missing data has been removed. But this is not always a good idea. Sometime we have small dataset, as we used in our example and removing the whole row means somewhere we are deleting some valuable information from dataset.

#Replacing Null values with Strategy:

For replacing null values we use the strategy that can be applied on a feature which has numeric data. We can calculate the Mean, Median or Mode of the feature and replace it with the missing values.

In the above line of code, it will affect the entire data-set and replaces every variable null values with their respective mean, and ‘inplace =True’ indicates to affect the changes to dataset.

If we need to replace particular variable with the strategies then we can use above line of code.

4. Divide the dataset into Dependent & Independent variable

After importing the dataset, the next step would be to identify the independent variable (X) and the dependent variable (Y).

Basically dataset might be labeled or unlabeled, here i am considering labeled dataset for a machine learning classification problem and considering a small dataset for better understanding, in our dataset there is four columns Country, Age, Salary and Purchased, Actually It is a dataset of a shopping complex those handle the customer data who purchased that product or not.

In our dataset there is three independent variables (Country, Age and Salary) and one dependent variable (Purchased) that we have to predict.

To read the columns, we will use iloc of pandas (used to fix the indexes for selection) which takes two parameters — [row selection, column selection].

Note:
: selects all, using [] helps you select multiple columns or rows, this is how to slice the dataset.

You can read more about the usage of iloc here.

This is how we were able to select the dependent variable (Y) and the independent variable (X).

5. Dealing with Categorical values

Now let’s see how to deal with categorical values.

in our dataset there is one categorical variable ‘Country’. Now it gets complicated for machines to understand texts and process them, rather than numbers, since the models are based on mathematical equations and calculations. Therefore, we have to encode the categorical data.

The library that we are going to use for the task is called Scikit Learn.preprocessing. There’s a class in the library called LabelEncoder which we will use for the encoding.

The next step is usually to create an object of that class. We will call our object lEncoder.

As you can see the first column contains data in text form. We can observe that there are 3 categories, France, Spain & Germany. Now to convert this into numerical we can use following code:

If we look at our variable X.

Here we can see that all three text value has been converted into numeric value:

As you can see the categorical values has been encoded. But there’s a problem!

The problem is still the same. Machine learning models are based on equations and it’s good that we replaced the text by numbers. However, since 1 is greater than 0 and 2 is greater than 1 the equations in the model will think that Spain has a higher value than Germany and France and Germany has a higher value than France. And that’s certainly not the case. These are actually three categories and there is no relational order between them.So , we have to prevent this, we’re going to use Dummy Variables.

What are the Dummy variables?

Dummy Variables” is one that takes the value 0 or 1 to indicate the absence or presence of some categorical effect that may be expected to shift the outcome.

That means instead of having one column here we are going to have three columns for each categories and having 1 and 0 for their values.

Number of Columns = Types of Categories

In our case we have 3 types, so we are going to have 3 columns. To do this we will import yet another library called OneHotEncoder.

Note : As OneHotEncoder requires that all values are integers, and not strings as we have. This means we first have to encode all the possible values as integers. means it should be integer values.

Next step is to create an object of that class with an important parameter called categorical_features which takes a value of the index of the column.

Using fit_transform() for OneHotEncoder as we used before for LabelEncoder.

Now we got our independent variable X in the form of numpy array. same can we check with the help of type(X). just for to showing you how X look like we will converted into Pandas Dataframe with the use of below code.

Means there is three variables corresponding to the categories. In above code we converted all float values to integers by using dtype.

Now will doing the Label Encoding with dependent variable (Y)

Now we got X and Y both are in encoded form, now both can be apply on Machine Learning model.

6. Split the dataset into training and test set

In machine learning we usually splits the data into Training and Testing data for applying models.

Generally we split the dataset into 70:30 or 80:20 (as per the requirement)it means, 70 percent data taken to train and 30 percent data taken to test.

For this task, we will import train_test_split from model_selection library of scikit.

Now to build our training and test sets, we will create 4 sets — X_train (training part of the features), X_test (test part of the features), Y_train (training part of the dependent variables associated with the X train sets, and therefore also the same indices) , Y_test (test part of the dependent variables associated with the X test sets, and therefore also the same indices). We will assign to them the train_test_split, which takes the parameters — arrays (X and Y), test_size (An ideal choice is to allocate 20% of the dataset to test set, it is usually assigned as 0.2. 0.25 would mean 25%).

7. Feature Scaling

The final step of data preprocessing is to apply the very important feature scaling.

Feature Scaling is a technique to standardize the independent features present in the data in a fixed range. It is performed during the data pre-processing.

Why Scaling :- Most of the times, your dataset will contain features highly varying in magnitudes, units and range. But since, most of the machine learning algorithms use Euclidean distance between two data points in their computations, this is a problem.

Feature Scaling is some thing which really effects the Machine Learning Model in so many ways . I agree there are so many situations where Feature Scaling is optional or not required . Still there are so many Machine Learning Algorithms where Feature Scaling is must have process . For instances — Regression,logistic regression, SVMs, k-means (see k-nearest neighbors), PCA, neural network etc.

There are so many ways to scale the feature or column value . Its completely scenario oriented that which Scalar will be more performance oriented . Lets start exploring them one by one –

Standardization :

This is one of the most use type of scalar in data preprocessing . This is known as z-score . This re distribute the data in such a way that mean (μ) = 0 and standard deviation (σ) =1 . Here is the below formula for calculation –

Normalization:

Normalization scales the feature between 0.0 & 1.0, retaining their proportional range to each other.

The range of normal distribution is [-1,1] with mean =0.

Min-Max Scalar Technique:

Specially when you need to transform the feature magnitude in [0,1] range . This Min-Max feature scaling technique is one the best option . Here is the formula –

In this approach, the data is scaled to a fixed range — usually 0 to 1.

We will apply the formula of standardization and fit it to a scale. To accomplish the job, we will import the class StandardScaler from the sckit preprocessing library and as usual create an object of that class.

Now We will transform all the data (X_train and X_test) to a same standardized scale.

let we check how X_train look like.

Do we need to apply feature scaling to dependent variable (Y))?

Ans: As we can see that dependent variable is categorical as it having only two value 0 and 1, and it is a classification problem so in that case we will not going to scaling this vector.
but if we will talk about for regression problem then we will do scaling with dependent variable as well.

These were the general steps for preprocessing the data. Depending on the dataset you have.

Thank you for making it until here! I hope you enjoyed this! 😃

If you have any questions or suggestions, please let me know!

Thank You! 😃😃😃

--

--