Comprehensive Guide to Machine Learning (Part 1 of 3)

Published in

Analytics Vidhya

7 min readAug 23, 2020

Pic Courtesy: https://expertsystem.com/machine-learning-definition/

In this comprehensive guide, I’ll try to explore the different gears and pinions that makes a machine learning model tick. If you try to google the definition of “machine learning”, most of the sources will show the below statement:

“Machine Learning is the field of study that gives computers the capability to learn without being explicitly programmed”

I like to think of machine learning as raising a newly born baby. At first, it’s pretty innocent of the ways of the world, and it doesn’t know what to do, and it requires help at every step. But as it slowly acquires data and knowledge, it evolves and learns to make decisions of its own.

In order to provide machines “the ability to learn”, we need to jump through several hoops, or as I’d like to call it “Ten Commandments of Machine Learning”. These are as listed below:

Acquiring data
Data Cleansing
Exploratory Data Analysis
Feature Engineering
Feature Selection
Train/Validation/Test split
Baseline model building
Hyper-parameters Tuning
Model validation
Making Predictions

Now let’s walk through these commandments one-by-one, and I hope by the end you’ll have acquired sufficient knowledge to build a machine learning model of your own.

1) Acquiring data

There are plenty of free-and-open-source datasets widely available on Kaggle. I’d strongly recommend to explore the below link to browse the wide variety of machine learning datasets:

Find Open Datasets and Machine Learning Projects | Kaggle

Download Open Datasets on 1000s of Projects + Share Projects on One Platform. Explore Popular Topics Like Government…

www.kaggle.com

On other hand, if you’d like to create your own datasets for building the machine learning model, you can perform web-scraping on multiple websites and data sources. This is out-of-scope for this post. However you can checkout the below link to explore more on web-scraping.

Data Science Skills: Web scraping using python

One of the first tasks that I was given in my job as a Data Scientist involved Web Scraping. This was a completely…

towardsdatascience.com

For this post, I’ll be using the “Pet Adoption” dataset hosted on HackerEarth. You can checkout the same at below link:

HackerEarth Machine Learning challenge: Adopt a buddy

Problem statement Having a pet is one of life's most fulfilling experiences. Your pets spoil you with their love…

www.hackerearth.com

2) Data Cleansing

Let’s start by exploring the training dataset. Below image shows the top-5 records from training data.

We can also get a verbose description of the dataset using in-built Pandas functionality, as shown below.

From the image we can see that “condition” field has only 17357 “non-null” entries, which mean there are 1477 “null” records in this field. Most machine learning models don’t have the ability to deal with null records. So we need to handle such entries manually, before feeding into the model.

Let’s start by doing a deep-dive into values in “condition” column. Below image shows the Pandas command and result for that.

As we can see, there are 3 categories in “condition” field. So for the null records, we can create a 4th category with value as “3.0”, to keep it distinct from other records. We can use Pandas “fillna” functionality to replace the null entries.

You can checkout the below post for further understanding of data cleansing techniques.

The Art of Cleaning Your Data

Drop that bad data like Obama drops mics

towardsdatascience.com

3) Exploratory Data Analysis

This is one of the most crucial step of developing a machine learning model. In this step, we will take a deep-dive into the training data and get ourselves accustomed with different data relations and data dependencies.

Matplotlib and Seaborn python libraries are our best-friends here in getting insight into the training data. I’ll show few of the basis EDA steps in this post.

Continuous Variables Analysis

For continuous numerical fields, we can get the distribution of data using “distplot” functionality of Seaborn library.

As we can see from above images, the data in “length” and “height” fields are more-or-less distributed in a uniform way.

Categorical Variables Analysis

Typically, categorical data attributes represents discrete values which belong to a specific finite set of categories or classes. These discrete values can be text or numeric in nature. Examples: movie, music and video game genres, country names, food and cuisine types, etc.

We can use “countplots” to understand the data distribution for categorical fields. This functionality is readily available in Seaborn library.

In below images, we can see the data distribution for “pet_size” and “condition” fields.

Outliers Analysis

An outlier is an observation that deviates significantly from rest of the data observations. Outliers can have many causes, such as:

Measurement or input error
Data corruption
True outlier observation (e.g. Michael Jordan in basketball)

We can use “boxplots” to detect outliers in data. You can refer the below link to understand more about boxplots and their functionalities.

Understanding Boxplots

The image above is a boxplot. A boxplot is a standardized way of displaying the distribution of data based on a five…

towardsdatascience.com

As we can see from below image, the dots towards the right of the plot are indicators of outliers in the data.

There are several methods to detect and treat outliers, which are explained in great details in below post. I’d recommend everyone to go through this to have better grasp at outliers handling.

Outlier Detection and Treatment: A Beginner's Guide

8 ways to identify and treat outliers in your data-set using python.

medium.com

For our purposes, we can use “cube-root” method of Numpy library to reduce the number of outliers, as shown in below image.

Data Correlation Analysis

In layman terms, correlation is a measure of how strongly one variable depends on another.

Consider a hypothetical dataset containing information about IT professionals. We might expect a strong relationship between age and salary, since senior project managers will tend to be paid better than young pup engineers. On the other hand, there is probably a very weak, if any, relationship between shoe size and salary.

Correlations can be positive or negative. Our age and salary example is a case of positive correlation. Individuals with a higher age would also tend to have a higher salary. An example of negative correlation might be age compared to outstanding student loan debt: typically older people will have more of their student loans paid off.

We can use the “heatmap” functionality of Seaborn library to display the correlation of different input features in training data.

As we can see in the image, there is high “positive” correlation between “X1” and “X2” fields. Also there is high “negative” correlation between “condition” and “breed category” fields.

I’d recommend to go through the below post for better understanding of correlation analysis and it’s impact on machine learning models.

Data Correlation can make or break your Machine Learning Project

Wield Data Correlation properly to help you choose which ML algorithm to use, reduce dimensionality in your data and…

towardsdatascience.com

Concluding Remarks

This concludes the 1st part of the comprehensive machine learning guide. In the next post, I’ll cover the feature engineering tips and techniques, feature selection methods and train/validation/test dataset split functionality.

You can find the codebase for this post at the below link. I’d highly recommend to get your own dataset (either from Kaggle or using web-scraping) and try out the different data-cleansing and EDA methods detailed in this post.

dlaststark/machine-learning-projects

Permalink Dismiss GitHub is home to over 50 million developers working together to host and review code, manage…

github.com

Please visit my blog (link below) to explore more on Machine Learning and Linux Computing.

Comprehensive Guide to Machine Learning (Part 1 of 3)

1) Acquiring data

Find Open Datasets and Machine Learning Projects | Kaggle

Download Open Datasets on 1000s of Projects + Share Projects on One Platform. Explore Popular Topics Like Government…

Data Science Skills: Web scraping using python

One of the first tasks that I was given in my job as a Data Scientist involved Web Scraping. This was a completely…

HackerEarth Machine Learning challenge: Adopt a buddy

Problem statement Having a pet is one of life's most fulfilling experiences. Your pets spoil you with their love…

2) Data Cleansing

The Art of Cleaning Your Data

Drop that bad data like Obama drops mics

3) Exploratory Data Analysis

Understanding Boxplots

The image above is a boxplot. A boxplot is a standardized way of displaying the distribution of data based on a five…

Outlier Detection and Treatment: A Beginner's Guide

8 ways to identify and treat outliers in your data-set using python.

Data Correlation can make or break your Machine Learning Project

Wield Data Correlation properly to help you choose which ML algorithm to use, reduce dimensionality in your data and…

Concluding Remarks

dlaststark/machine-learning-projects

Permalink Dismiss GitHub is home to over 50 million developers working together to host and review code, manage…

Freelance Blogger | Ramblings of Tech Geek

Ramblings of Tech Geek | Freelance blogging about my different technological undertakings

Written by Tapas Das