Photo by Franki Chamaki on Unsplash

5 Data pre-processing techniques essential for your ML model

Shreyansh Jain
Analytics Vidhya
Published in
6 min readApr 28, 2020

--

If you have been involved in data science projects then you may realize one thing that the first and primary step in data mining is Data pre-processing . In real life problems the raw data we are provided with , is quite untidy and machine learning models are unable to recognize patterns and extract information from it . So let us look one-by-one into various approaches to neaten your data:

  1. Handling null values : Null values are the values that are missing from your data in any row or column . Reasons why null value is present may be it was not recorded or the data became corrupted . In python they are marked as ‘Nan’ . You can check it by running following code -

We can fill these null values with mean value of that column or with most frequently occurring item in that column . Or we can replace Nan with some random value like -999. We can use fillna() function from pandas library to fill Nan’s with desired value. But if a column has enormous amount of null values , let’s say more than 50% than it would be better to drop that column from your dataframe . You can also fill null values with values from its k-Nearest Neighbors that are not null in that same column. Sklearn’s KNNImputer() can help you in doing this task .

2. Handling outliers : Outliers are the data points that are present at a far distance from rest of the values in data . They appear separated from the crowd . We can detect outliers using visualization tools such as Boxplots:

Boxplot

and by plotting scatter plots between two feature vectors:

Outliers in scatterplot

You can drop the outliers if you are aware with scientific facts behind data such as the range in which these data points must lie . For example if people’s age is a feature for your data , then you know well that it must lie between 0–100 or in some cases 0–130 years . But if value of age in data is somewhat absurd , let’s say 300 then it must be removed . But outliers does not always point to errors , they can sometimes point to some meaningful phenomena . If the predictions for your model are critical i.e small changes matter a lot then you should not drop these . Also if outliers are present in large quantity like 25% or more then it is highly probable that they are representing something useful . In that case you must examine those outliers carefully .

3. Normalizing or scaling data : If you are using distance based machine learning algorithms such as K-nearest neighbours , linear regression , K-means clustering etc or neural networks , then it is a good practice to normalize your data before feeding it to model .Normalization means to modify values of numerical features to bring them to a common scale without altering correlation between them. Values in different numerical features lie in different ranges , which may degrade your model’s performance hence normalization ensures proper assigning of weights to features while making predictions.Some popular techniques of normalization are :

a) Min-Max normalization - It scales feature to a given range between mimimum and maximum values . It is formulated as :

X(scaled)=a+ (b-a)(X - Xmin)/(Xmax - Xmin)

where a is minimum value and b is maximum value.

b) Z-score normalization- We subtract mean from each feature and then divide by its standard deviation so that the resultant scaled feature has zero mean and unit variance .It is formulated as :

X(scaled)=(X - mean(X)) /σ

By doing this you change the distribution of your data to normal distribution .

4. Encoding categorical features- Categorical features are the features that contain discrete data values . If a categorical feature has characters or words or symbols or dates as data values then these have to be encoded to numbers to become understandable to machine learning models since they only process numeric data . There are 3 approaches to encode your data :

a) Label encoding- In this type of encoding each discrete value in categorical feature is assigned a unique integer based on alphabetical ordering . In below example you can see each fruit is assigned a corresponding integer label:

Label encoding fruit names array

Label Encoding generally works well with linear models like linear regression , logistic regression and also neural networks .

b) One-hot Encoding- In this type of encoding each discrete value in categorical feature is assigned a unique one-hot vector or binary vector consisting of 1’s and 0’s . Only the index of discrete value is marked 1 in one -hot vector and rest all values are marked 0 . In below example you can see every fruit is assigned corresponding one-hot vector of length 5:

One-hot encoding fruit names array

One-hot encoding generally works well with tree based models such as Random forests and gradient boosted machines.

c) Mean Encoding- In this type of encoding every discrete value in your categorical feature is encoded with corresponding mean target label .To understand better let’s look at example below :

Data frame consisting of mean encoded feature

We have three fruit labels [‘Apple’, ‘Banana’, ‘Orange’] . Mean encoding for each fruit label is formulated as :

Encoded feature = True targets/Total targets

For Apple true targets are 3 and total targets are 4 hence mean encoding for apple will be 3/4 =0.75 . Similarly for Orange mean encoding will be 1/2=0.5 .For banana it will be 3/3 =1. Mean encoding is an extended version of label encoding and is more logical as compared to it since it takes target label under consideration .

5.Discretization: It is also a good pre-processing technique which can sometimes improve our model’s performance by reducing size of data. It is mainly used for numerical features . In discretization , numerical features are divided into bins/intervals .Each bin contain numeric values falling in a certain range . Number of values in a bin can be same or different .Each bin is then treated as a categorical value .Thus we can convert our numerical feature to a categorical feature using discretization.

So , these were different approaches which you can utilize to preprocess your data while implementing ML model . Hope you find this article useful.

--

--