Machine Learning & Deep Learning Guide

Part 1: Key terms, Definitions, and starting off with Supervised Learning (Linear Regression)

Published in

Analytics Vidhya

9 min readNov 9, 2019

Part 1: Key terms, Definitions, and starting off with Supervised Learning (Linear Regression).
Part 2: Supervised Learning: Regression (SGD) and Classification (SVM, Naïve Bayes, KNN, and Decision Tree).
Part 3: Unsupervised Learning (KMeans, PCA), Underfitting vs Overfitting, and cross-validation.
Part 4: Deep Learning: Definitions, Layers, Metrics and Loss, Optimizer and Regularization

Almost everyone who wants to start studying or working on machine learning and deep learning will be overwhelmed with: theoretical concepts, lots of mathematical rules, or lots of unnecessary details. In this tutorial, we will be able to understand almost all concepts related to machine learning and deep learning. Moreover, we will perform hands-on examples to build some cool models related to computer vision, speech recognition, and artificial intelligence game agents. After that, you will be able to participate in some machine learning challenges and maybe you can get high ranks.

Introduction

In this part of the tutorial, we will start by clearing up the confusion between Artificial Intelligence, Machine Learning, and Deep Learning. Then we will move on to understand different types of machine learning and start to create examples for each. We will do that without getting too deep into the definitions and concepts.

What is Artificial Intelligence (AI), Machine Learning (ML), and Deep Learning (DL)?

Artificial intelligence (AI) is the intelligence demonstrated by machines, in contrast to the intelligence displayed by humans.

Machine learning (ML) may be defined as the field of computer science, more specifically an application of artificial intelligence, which provides computer systems the ability to learn with data and improve from experience without being explicitly programmed.

Deep Learning (a subset of Machine Learning) works similarly to our brain, using a mesh network, technically termed a Deep Neural Network.
Just like our brain identifies patterns to classify things, and learn from mistakes — Deep Learning does it too. It compares the unknown data with the known data to classify it accordingly.

Now that we know the differences, we will start discussing machine learning after that we will start discussing deep learning.

Types of Machine Learning and their usages

Machine Learning algorithms (ML)

Machine learning can be categorized into the following:

Supervised Learning
Unsupervised Learning
Reinforcement Learning

I) Supervised Learning:

The data is labeled meaning for each input x we have the corresponding output y. For each

It is mainly split into two concepts:

Regression: Predicts continuous numerical value. How much will that house be sold for?
Classification: Estimate the discrete values and assign a label. Is it a cat or a dog?

Here are some popular algorithms used in supervised learning.

A. Regression:

Linear Regression: We try to find a ‘linear’ relation between the inputs and outputs. Consider that we working on two-dimensional space and we plot the inputs on X-axis and outputs on Y-axis. Then linear regression will be a straight line to best fit the data points.
Stochastic Gradient Descent (SGD)Regressor: It implements a plain stochastic gradient descent learning routine that supports different loss functions and penalties to fit linear regression models. SGD Regressor is well suited for regression problems with a large number of training samples (> 10.000).

B. Classification:

Support Vector Machines (SVM): Mainly used for classification problems. It uses a multi-dimensional hyperplane to separate data into classes.
Naïve Bayes: Is a classification algorithm. It assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.
k-Nearest Neighbors: It is widely used to solve classification problems. The idea is to label a test data point x by finding the mean if variables are continuous (or mode if they are categorical) of the k closest data points’ labels.
Decision Tree: Mostly used for classification. As the name shows, we draw a tree on each branch (decision node) we split the data based on independent variables. There are advanced algorithms built based on decision trees, such as random forest which is basically a collection of decision trees.

II) Unsupervised learning:

So far, we considered the case where the data is labeled. Sometimes, the data is not labeled. Here we will train the model to predict the corresponding labels.

It is mainly split into two concepts:

Clustering: Cluster the data into groups by similarities.
Dimensionality Reduction: reduce dimensionality to compress the data while maintaining its structure and usefulness.

Here are some popular algorithms used in unsupervised learning.

A. Clustering:

K-means: clusters data by trying to separate samples in n groups of equal variances, minimizing a criterion known as the inertia or within-cluster sum-of-squares. This algorithm requires the number of clusters to be specified.

B. Dimensionality Reduction:

Principal component analysis: is a technique used to emphasize variation and bring out strong patterns in a dataset. It requires two techniques:

Feature elimination: is what it sounds like: we reduce the feature space by eliminating features.
Feature extraction: Say we have ten independent variables. In feature extraction, we create ten “new” independent variables, where each “new” independent variable is a combination of each of the ten “old” independent variables. However, we create these new independent variables in a specific way and order these new variables by how well they predict our dependent variable.

Finally, some coding

Before going into reinforcement learning, it is time to do some coding.

Keep in mind that our main goal is to solve machine learning and deep learning problems and challenges. To do so we will follow the following steps:

Data definition: We have some data. This data consists of inputs or features represented with variable X and sometimes outcomes which are labels or predictions represented by the variable Y to be extracted from the model.
Train/Test Split: We split the data into training and test (later on we will introduce dev-test but let us stick with basics)
Preprocessing: We perform some preprocessing, clearing, and correction for each data set.
Algorithm Selection: We will use one or more algorithms depending on the type of the problem and on the benefits and usages of the algorithm. This model has some parameters that can be set.
Training: We ‘train’ our algorithm on the training set. Which is to learn the mapping from the features(x) to labels (y). During this phase, the weights of the model are updated.
Prediction: We test our model with “updated weight” on the test set
Evaluate Model’s Performance: We get a specific metric (for example accuracy) to evaluate our model.
Fine Tuning: Based on the performance result we start to tune and update our parameters to increase the accuracy and decrease the loss. Then we repeat steps 4–6 for some iteration or until the accuracy and loss became consistent.

We will be using python with a Scikit-Learn which is an open-source Python library that implements a range of machine learning, preprocessing, cross-validation and visualization algorithms using a unified interface.

Supervised learning - Linear Regression:

You can download the complete Kaggle notebook from here

Data definition: We will use Boston Housing Data. We want to predict the house prices based on some attributes such as per capita crime rate by town, the proportion of residential land zoned for lots over 25,000 sq. ft, the average number of rooms per dwelling, and others. I downloaded the file and renamed it to boston.csv and added the following line as a header of the file:
"CRIM","ZN","INDUS","CHAS","NOX","RM","AGE","DIS","RAD","TAX","PTRATIO","B","LSTAT","MEDV"

# Load the diabetes dataset
import pandas as pd
boston = pd.read_csv('data/boston.csv')y = boston['MEDV']
X = boston.drop('MEDV',axis=1)#View the top 5 rows
boston.head())

Show top five records from boston dataset

2. Train/Test split: As the dataset is very small, we will split it into 10% for testing and 90% for training

# Split train and test setfrom sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size= 0.1, random_state=0)# View the shape (structure) of the data
print(f"Training features shape: {X_train.shape}")
print(f"Testing features shape: {X_test.shape}")
print(f"Training label shape: {y_train.shape}")
print(f"Testing label  shape: {y_test.shape}")

Result:
Training features shape: (455, 13)
Testing features shape: (51, 13)
Training label shape: (455,)
Testing label shape: (51,)

3. Preprocessing: We didn’t need to do any data preprocessing. But it will be applied in later examples.

4. Algorithm Selection: We will use LinearRegression

#Linear Regression
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score,mean_absolute_error
lr = LinearRegression(normalize=True)

5. Training: Simply we call the function fit and give it X_train and y_train as parameters.

lr.fit(X_train, y_train) # Fit the model to the data

6. Prediction: Simply we call predict

y_pred_lr = lr.predict(X_test)  # Predict labels

7. Evaluate Model’s Performance:

# The mean squared error
print(f"Mean squared error: { mean_squared_error(y_test, y_pred_lr)}")# Explained variance score: 1 is perfect prediction
print(f"Variance score: {r2_score(y_test, y_pred_lr)}")# Mean Absolute Error
print(f"Mean squared error: { mean_absolute_error(y_test, y_pred_lr)}")

Result:
Mean squared error: 41.72457625585755
Variance score: 0.5149662051867079
Mean squared error: 3.9357920841192797

8. Fine Tuning: At this stage we will not do any fine tuning. To keep things smooth and simple.

Errors and Metrics:

As you can see from step 7 above, we used 3 functions: mean_squared_error, r2_score, and mean_absolute_error. But what does each mean? Again we will use the definitions from Scikit-learn

Mean squared error: It is a risk metric corresponding to the expected value of the squared (quadratic) error or loss.

2. R² score, the coefficient of determination: It represents the proportion of variance (of y) that has been explained by the independent variables in the model. It provides an indication of goodness of fit and therefore a measure of how well-unseen samples are likely to be predicted by the model, through the proportion of explained variance.

3. Mean absolute error: It is a risk metric corresponding to the expected value of the absolute error loss or l1-norm loss.

Recap

We have reached the end of part 1 of our series. In this part we were able to learn:

The difference between Artificial Intelligence (AI), Machine Learning (ML) and Deep Learning (DL).
The most popular types of Machine Learning: Supervised, Unsupervised and Reinforcement.
The different concepts of supervised learning: Regression and Classification. Along with definitions of some popular algorithms for Regression: Linear Regression and Stochastic Gradient Descent (SGD) Regressor. For Classification: Support Vector Machines, Naïve Bayes, k-Nearest Neighbors, and Decision Tree.
The different concepts of unsupervised learning: Clustering and Dimensionality Reduction. Along with definitions of some popular algorithms for Clustering (K-means) and for Dimensionality Reduction ( Principal component analysis)
The steps to solve machine learning and deep learning problems and challenges:
1. Data definition
2. Train/Test split
3. Preprocessing
4. Algorithm Selection
5. Training
6. Prediction
7. Evaluate the Model’s Performance
8. Fine Tuning
Popular Errors and Metrics used in Regression: Mean squared error, R² score(the coefficient of determination), and Mean absolute error.

Finally, we had a hands-on example of how to use Scikit-Learn and we performed a complete example of Linear Regression.

In Part 2, we will continue with examples of the remaining supervised learning algorithms along with the corresponding error and metrics used for classification. We will also perform some preprocessing for our data and we will learn new concepts such as cross-validation, over-fitting, and under-fitting.

Thanks for reading!