Deep Dive in Machine Learning with Python

Part — XV: Initial Data Analysis — II

Rajesh Sharma
Analytics Vidhya
4 min readFeb 16, 2020

--

Welcome to another blog of Deep Dive in Machine Learning with Python. In the previous blog, we worked with the Autism Spectrum Disorder (Children) dataset to understand how to effectively perform the first two stages of Initial Data Analysis.

In today’s blog, we will continue with the further stages of IDA and fill the MISSING values in ‘ETHNICITY’ and ‘WHOS_COMPLETING_TEST’. Also, build our first Machine Learning Regression model to predict the missing values in ‘AGE’.

Courtesy: Tenor

First-hand cleaned DataFrame

First-hand cleaned DataFrame achieved in the previous blog

Step-3: Fill Missing Values

‘ETHNICITY’ and ‘WHOS_COMPLETING_TEST’ contains Missing values

Step-3.1: Fixing Missing values in ‘Ethnicity’

Countries where ETHNICITY is Missing

Ethnicity count distribution where Country is ‘Jordan’

Ethnicities in JORDAN

Ethnicity count distribution where Country is ‘Egypt’

Ethnicities in EGYPT

Ethnicity count distribution where Countries are ‘Qatar’, ’Saudi Arabia’, ’Russia’ and ‘Pakistan’

Ethnicities in ‘Qatar’, ’Saudi Arabia’, ’Russia’ and ‘Pakistan’

Ethnicity count distribution where Countries are ‘Syria’, ’United Arab Emirates’, ’Lebanon’ and ‘Libya’

Ethnicities in ‘Syria’, ’United Arab Emirates’, ’Lebanon’ and ‘Libya’

Ethnicity count distribution where Countries are ‘China’, ’Kuwait’, ’Iraq’, ’Latvia’, ’Austria’ and ‘Malaysia’

Ethnicities in ‘China’, ’Kuwait’, ’Iraq’, ’Latvia’, ’Austria’ and ‘Malaysia’

Here, we explored people of which ethnicities from the above countries participated in this dataset. Now, by seeing the trend we will replace the missing value with the dominant ethnicity.

New Ethnicity column with no missing values
Comparison: With and Without missing values

Step-3.2: Fixing Missing values in ‘Whos_completing_test’

Count distribution in ‘Whos_completing_test’

In this feature, we have a paramount difference between ‘Parent’ and the rest of the categories. So, labeling ‘Parent’ as 1 and other categories as 0.

‘Parent’ labeled as 1 and other categories as 0

Step-4: Drop not-required columns

In this step, we will remove the Columns from the dataframe which have missing values and we have already fixed them by creating their new columns without any missing value.

Dropped ‘Ethnicity’ and ‘Whos_completing_test’

Step5: Convert categorical data into numerical form

In this step, we will convert the qualitative data in ‘Ethnicity’ and ‘Country’ into numerical values.

Step-5.1: ETHNICITY column in numerical form

Here, you can also use LabelEncoder to perform labeling, however, I defined the mapping in a dictionary to explain things in an easy to apprehend manner.

Step-5.2: COUNTRY column in numerical form

Count distribution in ‘COUNTRY’

As in the ‘Country’ column, we have differences amongst the categories counts, hence, bucketed all the countries with a count of ≤ 7 in a single class and labeled as 9.

Step-5.3: AGE_DESC column in numerical form

All the rows in AGE_DESC marked as 0.

Step6: ML Model to predict the missing values in the AGE variable

Records where we marked missing values in ‘AGE’ as 0

Here, we divided the DataFrame into two sections:

  1. Test Records where AGE is 0
  2. Train Records where AGE is not 0

First, we checked the dimensions in Parent, Train & Test records DataFrames. Then, we instantiated the RandomForestRegressor and trained it on the subset(created using train_test_split) of Train Records DataFrame.

Here, the model predicted the age for subset records. Then, we ran it on the Test DataFrame which contains records where age is 0.

And, we got the values that can replace the missing data in ‘AGE’.

Congratulations, we come to the end of this blog. To summarize, we covered the remaining stages of Initial Data Analysis (IDA).

Follow to get notified for the upcoming posts where we will work on building the understanding around Exploratory Data Analysis using an example dataset.

If you want to download the Jupyter Notebook of this blog, then kindly access below GitHub repository:

https://github.com/Rajesh-ML-Engg/Autism_Spectrum_Disorder

Thank you and happy learning!!

--

--

Rajesh Sharma
Analytics Vidhya

It can be messy, it can be unstructured but it always speaks, we only need to understand its language!!