How to perform Exploratory Data Analysis (EDA) and clean your data for model training?

Ashish Kumar
6 min readFeb 20, 2022
Data Analysis

EDA or Exploratory data analysis is the basic step to proceed ahead with any Machine Learning problem statement when you have the data in text format like in .csv format or .xlxs format or maybe even when you’re fetching these kinds of data from SQL or NoSQL databases. To understand the insights of data we have to do an Exploratory data analysis to move ahead in any machine learning project before training the model.

To understand this we’ll be taking on a problem statement where I’ll show you all the things you should be doing to perform EDA and cleaning your data for training them using machine learning models.

Problem Statement: To build a classification methodology to predict the type of Thyroid a person has based on the given features.

Lets Talk: If I tell you all your work for EDA will be done in 2–3 lines of code will you believe me? Let’s do it using a powerful library named Pandas Profiling.

So, How to use pandas profiling?

The first step is to install it with this command:

pip install pandas-profiling

Then we generate the report using these commands:

from pandas_profiling import ProfileReport #call pandas profling after instal
prof = ProfileReport(df)
prof.to_file(output_file='output.html') #your report will be generated in output file

With this, your EDA work is almost DONE !! You just need to see the report and gather insights as per your needs. The report generated will have all kinds of information that is needed to proceed ahead with feature engineering, scaling and data cleaning.

What to do now?

After your profiling report is generated and you have got the basic insights from data move ahead to explore the data.

Perform checks for the below things in your data:

  1. Are there any Null or missing values? How should we handle it?

So in the thyroid detection problem statement, the data didn’t have any NULL value but the NULL values are replaced with ‘?’ so you always need to clearly check your data first before moving ahead. If there are missing values in columns you replace them with many techniques we will be talking about it further. In this scenario we can replace them using NumPy .nan function like shown below:

import numpy as np
for
column in data.columns:
count = data[column][data[column]=='?'].count()
if count!=0:
data[column] = data[column].replace('?',np.nan)

Great !! now you can have all the values as NaN Values

Since the values are categorical in the Thyroid dataset, we’ll change them to numerical before using any imputation techniques.

What should your ideal approach be when you see Categorical and Numerical data in your dataset?

Using mapping for columns with two distinct values and getting dummies where there are more than two values why so?

Because there are only two categories the two columns formed after getting dummies will both have a very high correlation since they both explain the same thing. So in any way, we got to drop one of the columns. That’s why using mapping for such columns.

Now seeing that output class in Thyroid data you’ll realize it is a Multi-class Classification Problem for handing you’re output class you can use Label Encoder. like shown below:

from sklearn.preprocessing import LabelEncoder
lblEn = LabelEncoder()
data['Class'] =lblEn.fit_transform(data['Class'])

This will complete all the encoding for Categorical values.

Now we can focus on missing values. for this imputing the missing values can be done using an imputer here we’ll use KNN Imputer.

what KNN imputer does is it takes the three nearest values near your missing value and if two values are there with belonging to the same class then it sets that value for the missing value. you can even go with a basic approach for using mean, mode median in that missing values.

from sklearn.impute import KNNImputer
imputer=KNNImputer(n_neighbors=3, weights='uniform',missing_values=np.nan)
new_array=imputer.fit_transform(data) # impute the missing values
# converting the nd-array returned in the step above to a Dataframe
new_data=pd.DataFrame(data=np.round(new_array), columns=data.columns)

Now there will be no missing values in the new dataset. You can check it by using the below command

new_data.isna().sum()

This is how you should be dealing with missing values or wrong values imputed in your data before going training your data.

2. Checking distribution in your dataset

while checking for the distribution in your data you should always consider the skewness in your data columns. If skewness is there i.e your data is skewed towards left or right you should try to transform it.

Before doing log transformation, to transform your data you should add 1 to each value in the column to handle exceptions when we try to find the log of ‘0’. You can do it simply as shown below:

columns = ['age','TSH','T3','TT4','T4U','FTI'] #select the columns
plot.figure(figsize=(10,15),facecolor='white')
plotnumber = 1
for column in columns:
new_data[column]+=1 #adding one to pevent log 0
ax = plot.subplot(3,2,plotnumber)
sns.distplot(np.log(new_data[column]))
plot.xlabel(column,fontsize=10)
plotnumber+=1
plot.show()

After log transformation, the rest of the columns in the dataset look fine but ‘TSH’ has a weird trend. it's just a column in the dataset so if there is any weird trend we can drop that column before moving ahead because it won’t be providing much information.

3. Check for Balance and Imabalnce data in your Dataset

If you have imbalanced data try to make it balanced before going further. Else if you have a balanced set then you’re good to go ahead. In the thyroid detection problem statement, we have an imbalanced dataset so we tried to make it balanced.

So how to make an imbalanced dataset balanced?

There are many techniques to handle imbalanced datasets here we’ll be using the one named Random Oversampling technique for it.

from imblearn.over_sampling import RandomOverSampler
x = new_data.drop(['Class'],axis=1)
y = new_data['Class']
rdsmple = RandomOverSampler()
x_sampled,y_sampled = rdsmple.fit_resample(x,y)
Imbalanced Data for Output Class
Balanced Data for Output Class

4. Checking the VIF score for multicollinearity

You should check for features are having multicollinearity or not in the next step if they have it prevented. You can do it simply by using the below function here i am using the Standard scaler for scaling purposes and then checking the VIF score for columns having a VIF score of more than 10 we drop them.

def vif_score(x):
scaler = StandardScaler() # performing scaling on data without output class
arr = scaler.fit_transform(x)
return pd.DataFrame([[x.columns[i], variance_inflation_factor(arr,i)] for i in range(arr.shape[1])], columns=["FEATURE", "VIF_SCORE"])
vif_score(x_sampled)
VIF scores for features in Thyroid Dataset

We can see that the data has multicollinearity the features which are depicting that are referral_source_STMW, referral_source_SVHC, referral_source_SVHD referral_source_SVI, referral_source_other, TT4, and FTI so dropping the first four referral_source_STMW, referral_source_SVHC, referral_source_SVHD referral_source_SVI, TT4, and FTI for preventing multicollinearity.

Once you perform all the above checks you’re good to go ahead with training your model using any type of classification algorithm.

Access the full notebook Click here

For seeing the complete end to end project Click Here

Find me on GitHub, ResearchGate, Tableau, and LinkedIn to see more exciting projects and how to deal with new data you get.

--

--

Ashish Kumar

Dedicated software developer with 4 year's experience, a Master's in Computer Science, and a passion for delivering high-quality results in diverse projects.