Comprehensive Guide to Machine Learning (Part 1 of 3)

Tapas Das
Analytics Vidhya
Published in
7 min readAug 23, 2020
Pic Courtesy: https://expertsystem.com/machine-learning-definition/

In this comprehensive guide, I’ll try to explore the different gears and pinions that makes a machine learning model tick. If you try to google the definition of “machine learning”, most of the sources will show the below statement:

“Machine Learning is the field of study that gives computers the capability to learn without being explicitly programmed”

I like to think of machine learning as raising a newly born baby. At first, it’s pretty innocent of the ways of the world, and it doesn’t know what to do, and it requires help at every step. But as it slowly acquires data and knowledge, it evolves and learns to make decisions of its own.

In order to provide machines “the ability to learn”, we need to jump through several hoops, or as I’d like to call it “Ten Commandments of Machine Learning”. These are as listed below:

  • Acquiring data
  • Data Cleansing
  • Exploratory Data Analysis
  • Feature Engineering
  • Feature Selection
  • Train/Validation/Test split
  • Baseline model building
  • Hyper-parameters Tuning
  • Model validation
  • Making Predictions

Now let’s walk through these commandments one-by-one, and I hope by the end you’ll have acquired sufficient knowledge to build a machine learning model of your own.

1) Acquiring data

There are plenty of free-and-open-source datasets widely available on Kaggle. I’d strongly recommend to explore the below link to browse the wide variety of machine learning datasets:

On other hand, if you’d like to create your own datasets for building the machine learning model, you can perform web-scraping on multiple websites and data sources. This is out-of-scope for this post. However you can checkout the below link to explore more on web-scraping.

For this post, I’ll be using the “Pet Adoption” dataset hosted on HackerEarth. You can checkout the same at below link:

2) Data Cleansing

Let’s start by exploring the training dataset. Below image shows the top-5 records from training data.

We can also get a verbose description of the dataset using in-built Pandas functionality, as shown below.

From the image we can see that “condition” field has only 17357 “non-null” entries, which mean there are 1477 “null” records in this field. Most machine learning models don’t have the ability to deal with null records. So we need to handle such entries manually, before feeding into the model.

Let’s start by doing a deep-dive into values in “condition” column. Below image shows the Pandas command and result for that.

As we can see, there are 3 categories in “condition” field. So for the null records, we can create a 4th category with value as “3.0”, to keep it distinct from other records. We can use Pandas “fillna” functionality to replace the null entries.

You can checkout the below post for further understanding of data cleansing techniques.

3) Exploratory Data Analysis

This is one of the most crucial step of developing a machine learning model. In this step, we will take a deep-dive into the training data and get ourselves accustomed with different data relations and data dependencies.

Matplotlib and Seaborn python libraries are our best-friends here in getting insight into the training data. I’ll show few of the basis EDA steps in this post.

  • Continuous Variables Analysis

For continuous numerical fields, we can get the distribution of data using “distplot” functionality of Seaborn library.

As we can see from above images, the data in “length” and “height” fields are more-or-less distributed in a uniform way.

  • Categorical Variables Analysis

Typically, categorical data attributes represents discrete values which belong to a specific finite set of categories or classes. These discrete values can be text or numeric in nature. Examples: movie, music and video game genres, country names, food and cuisine types, etc.

We can use “countplots” to understand the data distribution for categorical fields. This functionality is readily available in Seaborn library.

In below images, we can see the data distribution for “pet_size” and “condition” fields.

  • Outliers Analysis

An outlier is an observation that deviates significantly from rest of the data observations. Outliers can have many causes, such as:

  1. Measurement or input error
  2. Data corruption
  3. True outlier observation (e.g. Michael Jordan in basketball)

We can use “boxplots” to detect outliers in data. You can refer the below link to understand more about boxplots and their functionalities.

As we can see from below image, the dots towards the right of the plot are indicators of outliers in the data.

There are several methods to detect and treat outliers, which are explained in great details in below post. I’d recommend everyone to go through this to have better grasp at outliers handling.

For our purposes, we can use “cube-root” method of Numpy library to reduce the number of outliers, as shown in below image.

  • Data Correlation Analysis

In layman terms, correlation is a measure of how strongly one variable depends on another.

Consider a hypothetical dataset containing information about IT professionals. We might expect a strong relationship between age and salary, since senior project managers will tend to be paid better than young pup engineers. On the other hand, there is probably a very weak, if any, relationship between shoe size and salary.

Correlations can be positive or negative. Our age and salary example is a case of positive correlation. Individuals with a higher age would also tend to have a higher salary. An example of negative correlation might be age compared to outstanding student loan debt: typically older people will have more of their student loans paid off.

We can use the “heatmap” functionality of Seaborn library to display the correlation of different input features in training data.

As we can see in the image, there is high “positive” correlation between “X1” and “X2” fields. Also there is high “negative” correlation between “condition” and “breed category” fields.

I’d recommend to go through the below post for better understanding of correlation analysis and it’s impact on machine learning models.

Concluding Remarks

This concludes the 1st part of the comprehensive machine learning guide. In the next post, I’ll cover the feature engineering tips and techniques, feature selection methods and train/validation/test dataset split functionality.

You can find the codebase for this post at the below link. I’d highly recommend to get your own dataset (either from Kaggle or using web-scraping) and try out the different data-cleansing and EDA methods detailed in this post.

Please visit my blog (link below) to explore more on Machine Learning and Linux Computing.

--

--