DATA CLEANING & DEALING WITH OUTLIERS USING DATA IMPUTATION TECHNIQUES

DATA CLEANING:

Real world data is collected from multiple resources and there are high chances of having corrupt data. There might be missing values in the data set and it might need some transformations in order for it to be useful. Cleaning this data could be simply filling up these voids by a technique called data Imputation.
It is one of the crucial data cleaning part as in the future we will require to extract data from it, in order to build an efficient model. This stage plays a significant role in the accuracy of the final model.
Whenever we find values/data missing, it is the responsibility of the data scientist to ensure the missing data is handled effectively, in the data cleaning step. Some common Imputation techniques include either of the below three strategies:
I, Mean
II, Median
III, Mode

The way to calculate mean and median. Mode is the value which is repeated most number of times

Consider the below image:

This is the basic code which we follow to read and import a CSV file. We have learnt this step in detail in our first blog whose link is given at the end of blog. https://docs.google.com/spreadsheets/d/1fDnpbPMIDFaojSYWSCu7QtTz1Lch6I6zkhjd2O8Q11A/edit?usp=sharing

→ What do these NaN values indicate ??
They are the missing values in the data set.
We can handle them in two ways:
1, By eliminating the rows of missing values. (Generally, not recommended as it might reduce the data set to some extent leading to less data to be trained)
2, By using an Imputer to find the best possible substitute to replace missing values

Count of total number of NaN values

→ In order to know the total no. of NaN values we use lambda function.
Syntax: data1.apply(lambda x:sum(x.is.null()),axis=0)
isnull()-checks whether the value is null or not.

axis=0-selects all the columns & axis=1checks all the rows as shown in the figure.

Any other value given to axis results into an error as shown

→ Let us now start to fill these values by using Imputer function present in the sklearn.preprocessing module, as shown in fig: 1.By writing the above line the missing values are replaced by the mean of all the other values present in the data set. Now, we need to convert our data set (i.e data 1) into array in order to perform operations on it.

fig. 1
Syntax: newdataframe=olddataframe.iloc[:,:].values
Syntax: imputer_variable.fit(data2[:,[0,1]])
data2[:,[0,1]]=imputer_variable(data2[:,[0,1]])

display data2
→[:,:] indicates all the rows and all the columns respectively.
→ As shown in the fig. the data is converted into arrays in order to perform operations on it as array is part of numpy library.

Here, we are trying to fit all the rows of 0th & 1st index columns into the imputer, so that the NaN values get replaced by the strategy used.

fig. mean_imputation

→ Now, as we see in the fig: ‘mean_imputation’ we find that the voids (NaN) are successfully filled by the imputer.

→ It is just an assumption that placing mean values in place of missing values. However, we need to test other strategies as well to choose the best.

Now, let us go for median strategy as shown below: (syntax of imputer remains same)

Median strategy
most_frequent/ mode stratergy
Value of Median is 31.5 and value of most_frequent is 23

Now after all these we need to choose which strategy suits our algorithm. We can find it out using the box plot as shown below:

→This plotting is before removing outliers.
→ Outliers are the values which exceed the range (or) it is also referred to as out of bound data
(as we have seen this in our previous blog)
→ Here we find that our output ozone is mostly effected by solar → Plotting the graphs of all the three imputation techniques with outliers for future comparison.
The code says plotting Ozone on y-axis & Solar on x-axis
→ The axis line coinciding the blue dots (data points) is called the axhline. The ‘r’ indicates the line to be red colour and the we are viewing the line in the linear form (-)
→ Finally, displaying the graph → The same procedure is followed for all the imputers as shown below for median, mode.

Graphs of median & mode are shown respectively

Note: The more no. of points the red line crosses the better is it’s substitution
The next Step is to check for the Outliers & repeat the same process by eliminating the outliers (if outliers are found).
Outliers are denoted by tiny circles out of the range in the graph.
Checking for Outliers is carried out using boxplot() as shown below:

From the above figure we get to know that outliers are present on ozone and we need to eliminate them.

Normalization :
→ Normalization is a technique often applied as a part of data preparation.
→ Goal
— is to scale the numeric values in the columns.
→ Not every data set in ML require normalization but only the data sets whose features have different ranges require normalization
→ When we normalize a data set, we scale down all the feature values within the range of 0–1
Eg: Consider the given data set of air quality
we notice the attribute ranges are varying from single digits to triple digits place. So, we use normalization to keep all the values in one range.
2. Another real time scenario can be age & salary of a person. Both the attribute’s ranges vary (i.e age lies between 1–100, while salary might range between $1- $90,000 or even more)

To perform the operation we take ozone value into y:
Steps to perform normalization:
i) Reshaping y in the range of (-1,1)
ii) from sklearn import preprocessing
iii) In the next step we assign the variable y with transformed values
v) Displaying the result — In the output we can observe that our values are between (0,1) & the output
is in the form of array.

To perform operations we need to convert the values into arrays , the range will be from 0 to 1
Mean and Median values

→ In most of the cases we use mean as the best value because it is the average of all the values. But you can try other techniques as well to allot the best values to the missing data with different strategies for an efficient model. Check the red line to view the maximum points coinciding the line and choose that strategy.

😃 Well, then that’s it for this blog.
Hope you liked it 👍
Any queries please feel free to as Renish Sundrani, Kiran Lakhani
You can also connect us via linkedIn- https://www.linkedin.com/in/renish-sundrani-6a748317a & https://www.linkedin.com/in/kiran-lakhani-20

→ The link for our previous blog which has is : https://medium.com/@reni.sundrani/installation-of-jupyter-notebook-getting-started-with-machine-learning-7443a5e09309?sk=10132b40b011596877d43818238db352

--

--