Machine Learning & Deep Learning Guide

Published in

Analytics Vidhya

15 min readNov 17, 2019

Welcome to part 2 of the Machine Learning & Deep Learning Guide where we learn and practice machine learning and deep learning without being overwhelmed by the concepts and mathematical rules.

Part 1: Key terms, Definitions and starting off with Supervised Learning (Linear Regression).
Part 2: Supervised Learning : Regression (SGD) and Classification (SVM, Naïve Bayes, KNN and Decision Tree).
Part 3: Unsupervised Learning (KMeans,PCA), Underfitting vs Overfitting and cross validation.
Part 4: Deep Learning: Definitions, Layers, Metrics and Loss, Optimizer and Regularization

Learning Objectives

In this part, we will continue with examples of the remaining supervised learning algorithms along with the corresponding error and metrics used for classification.

Types of Machine Learning and their usages

Supervised learning — Stochastic Gradient Descent (SGD) Regressor:

In Part 1, we had an example of how to create our first regression model the Linear Regression model. Now we will be checking SGD Regressor.

You can download the complete Kaggle notebook from here

We will also follow the steps we mentioned to solve a machine learning problem:

Data definition
Train/Test split
Preprocessing
Algorithm Selection
Training
Prediction
Evaluate Model’s Performance
Fine Tuning

Data definition: We will use the “Crowdedness at the Campus Gym” dataset. Given a time of day (and maybe some other features, including weather), predict how crowded the gym will be. We will download the data and save it in the folder data and name it crowdedness.csv

import numpy as np # linear algebra
import pandas as pd # data processing
df = pd.read_csv("data/crowdedness.csv")

Print the columns to have an overview of the data

print(df.columns.values)

Result:
[‘number_people’ ‘date’ ‘timestamp’ ‘day_of_week’ ‘is_weekend’
‘is_holiday’ ‘temperature’ ‘is_start_of_semester’ ‘is_during_semester’
‘month’ ‘hour’]

Print the info of the data to see the type of each column

print(df.info())

We can see that there is one float column which is the temperature and more interesting is that there is one object column date.

Let us print the first few rows and see what we can find.

print(df.head())

Show top five records from crowdedness dataset

We can notice that the date column is the date and time on which we collected the data and the same is applied to the timestamp. So just to make sure we will print the unique values in both columns.

print(df[‘date’].unique)

print(df['timestamp'].unique)

Then we were right, this means we can drop both columns

df = df.drop(['date','timestamp'],axis=1)

2. Train/Test split:

# Extract the training and test data
data = df.values
X = data[:, 1:]  # all rows, no label
y = data[:, 0]  # all rows, label only# Extract the training and test data
data = df.values
X = data[:, 1:]  # all rows, no label
y = data[:, 0]  # all rows, label onlyfrom sklearn.model_selection import train_test_split
X_train_original, X_test_original, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)# View the shape (structure) of the data
print(f"Training features shape: {X_train_original.shape}")
print(f"Testing features shape: {X_test_original.shape}")
print(f"Training label shape: {y_train.shape}")
print(f"Testing label shape: {y_test.shape}")

Result:
Training features shape: (46638, 9)
Testing features shape: (15546, 9)
Training label shape: (46638,)
Testing label shape: (15546,)

3. Preprocessing: We will use StandardScaler which is defined by Scikit-learn as: Standardize features by removing the mean and scaling to unit variance. In other words, the distribution will be centered around 0, with a standard deviation of 1. We use this method to make the model work faster and have all our features scaled to the same set.

from sklearn.preprocessing import StandardScaler
# Scale the data to be between -1 and 1
scaler = StandardScaler()
scaler.fit(X_train_original)
X_train = scaler.transform(X_train_original)
X_test = scaler.transform(X_test_original)

To understand more about what happened, consider the below code to display the temperature for the first 60 records in our training set. I am going to use Matplotlib

# Import library
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
matplotlib.style.use('ggplot')# Specify number of plot to be displayed and the figure size
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(10, 10))# Set a title and plot the data
ax1.set_title('Before Scaling')
ax1.plot(X_train_original[:60,3])ax2.set_title('After Standard Scaler')
ax2.plot(X_train[:60,3])# Display the graph
plt.show()

Comparison between original and scaled data

As you can see the two graphs have the same shape but the only difference is the values on Y-axis. In the original data, we see that the temperature was between 46.26 and 67.91 while in the scaled data is between -1.9505 and 1.47793.

For now, let us say that this makes the calculation in the background much easier leading to the model running faster.

Important notes:

I want to draw your attention to the below notes:

We did the preprocessing after we split the data into training and test sets
We applied the fit only on the training set and not the test set. This means we calculated the mean and standard deviation of the data in the training set and then applied it to both the training and test set.

scaled_train =  (train - train_mean) / train_std_deviation
scaled_test = (test - train_mean) / train_std_deviation

This is a standard procedure used in machine learning, the reason behind it is that we want to treat the test set as new and unseen data. This way we will test how will our model perform in real applications with new unseen data.

4. Algorithm Selection: We will SGDRegressor with some parameters.

# Establish a model
from sklearn.linear_model import SGDRegressor
sgd_huber=SGDRegressor(alpha=0.01, learning_rate='optimal', loss='huber',penalty='elasticnet')

5. Training:

sgd_huber.fit(X_train, y_train)

6. Prediction:

y_pred_lr = sgd_huber.predict(X_test)  # Predict labels

7. Evaluate Model’s Performance:

from sklearn.metrics import mean_squared_error, r2_score,mean_absolute_error# The mean squared error
print(f"Mean squared error: {round( mean_squared_error(y_test, y_pred_lr),3)}")# Explained variance score: 1 is perfect prediction
print(f"Variance score: {round(r2_score(y_test, y_pred_lr),3)}")# Mean Absolute Error
print(f"Mean squared error: { round(mean_absolute_error(y_test, y_pred_lr),3)}")

Result:
Mean squared error: 348.267
Variance score: 0.324
Mean squared error: 14.617

8. Fine Tuning: As you can see, the variance is low. Can we do something to increase it? Well yes. Remember in step 4, we set some parameters (alpha, learning_rate, loss, and penalty) let us use different values.

# Try different parameters
# Try different parameters
sgd_l2 = SGDRegressor(alpha=0.01,learning_rate='optimal', loss='squared_loss',
             penalty='l2')sgd_l2.fit(X_train, y_train)
print(f"Score on training set {round(sgd_l2.score(X_train, y_train),3)}")y_pred_lr = sgd_l2.predict(X_test)  # Predict labelsfrom sklearn.metrics import mean_squared_error, r2_score,mean_absolute_error# The mean squared error
print(f"Mean squared error: {round( mean_squared_error(y_test, y_pred_lr),3)}")# Explained variance score: 1 is perfect prediction
print(f"Variance score: {round(r2_score(y_test, y_pred_lr),3)}")# Mean Absolute Error
print(f"Mean squared error: { round(mean_absolute_error(y_test, y_pred_lr),3)}")

Result:
Score on training set 0.506
Mean squared error: 249.126
Variance score: 0.517
Mean squared error: 12.064

Cool, the error decreased,d, and the variance increased to more than 0.5.
But the most important question here, is how can we know what to change? The answer is by using Hyper-Parameters Tuning. The idea is simple, we prepare a few combinations of the hyper-parameters and apply them. Then we see the best result and parameters.

# Establish a model
model = SGDRegressor(learning_rate='optimal',penalty='l2')
from sklearn.model_selection import GridSearchCV
# Grid search - this will take about 1 minute.
param_grid = {
    'alpha': 10.0 ** -np.arange(1, 7),
    'loss': ['squared_loss', 'huber', 'epsilon_insensitive'],
}
clf = GridSearchCV(model, param_grid)
clf.fit(X_train, y_train)
print(f"Best Score: {round(clf.best_score_,3)}" )
print(f"Best Estimator: {clf.best_estimator_}" )
print(f"Best Params: {clf.best_params_}" )

So what is next, so far we saw how our SGDRegressor model performed on the dataset and it was not very good. How can we perform better?
One way would be to use a different range of parameters but I don’t think it will increase much.
Another way would be to switch to the different regression model. RandomForestRegressor would be a good option.

Remember, our goal is not to solve one problem by itself but to have the knowledge, intuitions, procedures, and tools how to solve such problems and challenges.

Supervised learning — Classification:

In order to have a better knowledge of the subject, we will consider one classification problem and then use and compare more than one classification algorithm.

You can download the complete Kaggle notebook from here

Data definition: We will use Titanic Data. In this challenge, we are asked to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc)

import warnings
warnings.filterwarnings("ignore")# Load the diabetes dataset
import pandas as pd
train_df = pd.read_csv("../input/titanic/train.csv")
test_df = pd.read_csv("../input/titanic/test.csv")

Print the columns to have an overview of the data

print(train_df.columns.values)

Result:
[‘PassengerId’ ‘Survived’ ‘Pclass’ ‘Name’ ‘Sex’ ‘Age’ ‘SibSp’ ‘Parch’
‘Ticket’ ‘Fare’ ‘Cabin’ ‘Embarked’]

print(test_df.columns.values)

Result:
[‘PassengerId’ ‘Pclass’ ‘Name’ ‘Sex’ ‘Age’ ‘SibSp’ ‘Parch’ ‘Ticket’ ‘Fare’
‘Cabin’ ‘Embarked’]

Let us print the first few rows and see what we can find.

train_df.head()

test_df.head()

Before giving feedback on the findings let us print the info of the data to see the type of each column then we will call the describe function for the columns of the type object.

train_df.info()

test_df.info()

Let us see see some stats related to the fields of type object

train_df.describe(include=['O'])

test_df.describe(include=['O'])

Now we will check which features have null values

train_df.columns[train_df.isnull().any()]

Result:
[‘Age’, ‘Cabin’, ‘Embarked’]

Interestingly, from the above, we could have some basic knowledge of the data. We can notice that:

We have 5 columns of type object: Name, Sex, Ticket, Cabin, and Embarked.
PassengerId is the unique ID of the passenger so it should be dropped.
The name seems to have unique values and it might have a low effect on survival so we might need to drop it.I also noticed that it includes some titles (Mr., Mrs., Miss…) which can be used in generating new features. But for now, we will ignore it.
The ticket is alpha-numeric. Might have unique values so it might be dropped as well.
We have the following data types:
A. Ordinal: Survived and Pclass
B. Categorical: Sex and Embarked
C. Discrete: SibSp and Parch
D. Continous: Fare and Age
E. Alphanumeric: Ticket and Cabin

First, let us extract the Title feature from the feature Name so we can drop Name. I will use the feature engineering method suggested by Kaggle. I will also drop the PassengerId.

import string
def substrings_in_string(big_string, substrings):
    for substring in substrings:
        if str.find(big_string, substring) != -1:
            return substring
    return np.nantitle_list=['Mrs', 'Mr', 'Master', 'Miss', 'Major', 'Rev',
                'Dr', 'Ms', 'Mlle','Col', 'Capt', 'Mme', 'Countess',
                'Don', 'Jonkheer']
train_df['Title']=train_df['Name'].map(lambda x: substrings_in_string(x, title_list))
test_df['Title']=test_df['Name'].map(lambda x: substrings_in_string(x, title_list))#replacing all titles with mr, mrs, miss, master
def replace_titles(x):
    title=x['Title']
    if title in ['Don', 'Major', 'Capt', 'Jonkheer', 'Rev', 'Col']:
        return 'Mr'
    elif title in ['Countess', 'Mme']:
        return 'Mrs'
    elif title in ['Mlle', 'Ms']:
        return 'Miss'
    elif title =='Dr':
        if x['Sex']=='Male':
            return 'Mr'
        else:
            return 'Mrs'
    else:
        return title
    
train_df['Title']=train_df.apply(replace_titles, axis=1)
test_df['Title']=test_df.apply(replace_titles, axis=1)#Drop the columns 'Name', 'PassengerId' and 'Ticket'
train_df = train_df.drop(['Name','PassengerId','Ticket'],axis=1)
test_df = test_df.drop(['Name','PassengerId','Ticket'],axis=1)

Ok now from the definition of the features provided by Kaggle we see
“SibSp is the Number of siblings/spouses aboard the Titanic” and “SibSp Number of parents/children aboard the Titanic”
So let us create a new feature name it Family_Size that will have the summation of SibSp and SibSp.

train_df['Family_Size']=train_df['SibSp']+train_df['Parch']
test_df['Family_Size']=test_df['SibSp']+test_df['Parch']

Regarding the missing values, I have the following plan to fill them:
1. Age: I will use the data mean of ages
2. Cabin: I will fill them with ‘N’
3. Embarked: There are only two missing values so I will use mode

Moreover, I will convert Age and Fare from a continuous variable to a categorical variable. This can be done using the cut function from the pandas library.

import numpy as np
from scipy.stats import modefor df in [train_df, test_df]:
    
    meanAge=np.mean(df.Age)
    df.Age=df.Age.fillna(meanAge)
    bins = (-1, 0,  50, 100)
    group_names = ['Unknown', 'Under_50', 'More_Than_50']
    categories = pd.cut(df.Age, bins, labels=group_names)
    df.Age = categories
    
    df.Cabin = df.Cabin.fillna('N')
    df.Cabin = df.Cabin.apply(lambda x: x[0])
    
    modeEmbarked = mode(df.Embarked)[0][0]
    df.Embarked = df.Embarked.fillna(modeEmbarked)
    
    df.Fare = df.Fare.fillna(-0.5)
    bins = (-1, 0, 8, 15, 31, 1000)
    group_names = ['Unknown', '1_quartile', '2_quartile', '3_quartile', '4_quartile']
    categories = pd.cut(df.Fare, bins, labels=group_names)
    df.Fare = categories

What we did till now is basic feature engineering. We could continue and create new features or check the correlations and dependencies between features but we will leave it for now.

2. Train/Test split:

# Extract the training and test data
y = train_df['Survived']
X = train_df.drop('Survived',axis=1)from sklearn.model_selection import train_test_split
X_train,X_val,y_train,y_val = train_test_split(X,y,test_size= 0.2, random_state=0)# View the shape (structure) of the data
print(f"Training features shape: {X_train.shape}")
print(f"Testing features shape: {X_val.shape}")
print(f"Training label shape: {y_train.shape}")
print(f"Testing label shape: {y_val.shape}")

Result:
Training features shape: (712, 10)
Testing features shape: (179, 10)
Training label shape: (712,)
Testing label shape: (179,)

3. Preprocessing: Now it is time to convert the fields of the type object into numerics. Here are the fields that are objects.

train_df.describe(include=['O'])

The features: Sex, Embarked and Title have few distinct values. So we will use OrdinalEncoder.
This way the values will be integers are follow:
1. Sex: Female will be replaced by 0 and male by 1
2. Embarked: C will be replaced by 0, Q by 1, and S by 2
3. Title: Master will be replaced by 0, Miss by 1, Mr by 2, and Mrs by 3

For the features Age and Fare, I will use LabelEncoder

# Print top 10 records before transformation
X_train[0:10]

from sklearn.preprocessing import OrdinalEncoder
encoder_sex = OrdinalEncoder()
X_train['Sex'] = encoder_sex.fit_transform(X_train['Sex'].values.reshape(-1, 1))
X_val['Sex'] = encoder_sex.transform(X_val['Sex'].values.reshape(-1, 1))encoder_cabin = OrdinalEncoder()
X_train['Cabin'] = encoder_cabin.fit_transform(X_train['Cabin'].values.reshape(-1, 1))
X_val['Cabin'] = encoder_cabin.transform(X_val['Cabin'].values.reshape(-1, 1))encoder_embarked = OrdinalEncoder()
X_train['Embarked'] = encoder_embarked.fit_transform(X_train['Embarked'].values.reshape(-1, 1))
X_val['Embarked'] = encoder_embarked.transform(X_val['Embarked'].values.reshape(-1, 1))encoder_title = OrdinalEncoder()
X_train['Title'] = encoder_title.fit_transform(X_train['Title'].values.reshape(-1, 1))
X_val['Title'] = encoder_title.transform(X_val['Title'].values.reshape(-1, 1))from sklearn.preprocessing import LabelEncoder
features = ['Fare',  'Age']for feature in features:
        le = LabelEncoder()
        le = le.fit(X_train[feature])
        X_train[feature] = le.transform(X_train[feature])
        X_val[feature] = le.transform(X_val[feature])

Now we will print the result after transformation

# Print top 10 records after transformation
X_train[0:10]

4. Algorithm Selection: We will use Support Vector Machines (Kernel SVM) , Naïve Bayes (GaussianNB), k-Nearest Neighbors (KNeighborsClassifier)
and Decision Tree (DecisionTreeClassifier) then compare the results.

from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifiernames = ["Kernel SVM", "Naive Bayes", "K Nearest Neighbor",
         "Decision Tree"]classifiers = [
    SVC(kernel = 'rbf',gamma='scale'),
    GaussianNB(),
    KNeighborsClassifier(3),
    DecisionTreeClassifier(max_depth=5)]

5. Training and Prediction :

# iterate over classifiers
for name, clf in zip(names, classifiers):
    clf.fit(X_train, y_train) 
    y_pred = clf.predict(X_val)
# Here we will add the error and evaluation metrics

6. Evaluation of each Model’s Performance: Here we will check the errors and evaluation metrics for classification:

We will use 3 functions: accuracy_score, classification_report, and confusion_matrix. The definitions are from Scikit-learn

Accuracy Score: It computes the accuracy, either the fraction (default) or the count (normalize=False) of correct predictions.

from sklearn.metrics import accuracy_scoredata = []
# iterate over classifiers
for name, clf in zip(names, classifiers):
    clf.fit(X_train, y_train)
    
    y_pred = clf.predict(X_val)
    print(f"Accuracy for {name} : {accuracy_score(y_val, y_pred)*100.0}")
    data.append(accuracy_score(y_val, y_pred)*100.0)models = pd.DataFrame({
    'Model': names,
    'Score': data})
models.sort_values(by='Score', ascending=False)

As you can see K Nearest Neighbor has the highest accuracy.

2. Classification Report: Build a text report showing the main classification metrics (precision, recall, f1-score, and support)

a. The precision (also called positive predictive value) is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. The precision is intuitively the ability of the classifier not to label as positive a sample that is negative.

b. The recall (also known as sensitivity) is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.

Source: Wikipedia

c. The F1 score is the harmonic mean of the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0.

d. The support is the number of occurrences of each class in y_true

from sklearn.metrics import classification_report# iterate over classifiers
for name, clf in zip(names, classifiers):
    clf.fit(X_train, y_train)
    
    y_pred = clf.predict(X_val)
    
    print(f"Classification Report for {name}")
    print(classification_report(y_val, y_pred))
    print('_'*60)

Classification Report for K Nearest Neighbor

3. Confusion Matrix ( also known as an error matrix)sc: is a specific table layout that allows visualization of the performance of an algorithm. It reports the number of false positives, false negatives, true positives, and true negatives.

# I will use the code from : https://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.htmlimport matplotlib.pyplot as plt
from sklearn.utils.multiclass import unique_labels
from sklearn.metrics import confusion_matrixdef plot_confusion_matrix(y_true, y_pred, classes,
                          normalize=False,
                          title=None,
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if not title:
        if normalize:
            title = 'Normalized confusion matrix'
        else:
            title = 'Confusion matrix, without normalization'# Compute confusion matrix
    cm = confusion_matrix(y_true, y_pred)
    # Only use the labels that appear in the data
    #classes = classes[unique_labels(y_true, y_pred)]
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')print(cm)fig, ax = plt.subplots()
    im = ax.imshow(cm, interpolation='nearest', cmap=cmap)
    ax.figure.colorbar(im, ax=ax)
    # We want to show all ticks...
    ax.set(xticks=np.arange(cm.shape[1]),
           yticks=np.arange(cm.shape[0]),
           # ... and label them with the respective list entries
           xticklabels=classes, yticklabels=classes,
           title=title,
           ylabel='True label',
           xlabel='Predicted label')# Rotate the tick labels and set their alignment.
    plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
             rotation_mode="anchor")# Loop over data dimensions and create text annotations.
    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i in range(cm.shape[0]):
        for j in range(cm.shape[1]):
            ax.text(j, i, format(cm[i, j], fmt),
                    ha="center", va="center",
                    color="white" if cm[i, j] > thresh else "black")
    fig.tight_layout()
    return axclass_names = np.array([0,1])np.set_printoptions(precision=2)# iterate over classifiers
for name, clf in zip(names, classifiers):
    clf.fit(X_train, y_train)
    
    y_pred = clf.predict(X_val)
    
    print(f"Confusion Matrix for {name}")
    # Plot non-normalized confusion matrix
    plot_confusion_matrix(y_val, y_pred, classes=class_names,
                          title='Confusion matrix, without normalization')
    # Plot normalized confusion matrix
    plot_confusion_matrix(y_val, y_pred, classes=class_names, normalize=True,
                          title='Normalized confusion matrix')
    plt.show()
    print('_'*60)

So what is next, in case we want to improve the performance we can change the parameters used for each algorithm ( for example use different n_neighbors in KNeighborsClassifier) or we can use different algorithms such as Random Forest or XGBoost.

Recap

We have reached the end of part 2 of our series. In this part we were able to learn:

The second regression model: Stochastic Gradient Descent (SGD) Regressor
Preprocessing for our dataset: StandardScaler, OrdinalEncoder and LabelEncoder.
Data Analysis and Feature Engineering.
Hyper-Parameters Tuning.
The basic algorithms used for Supervised Learning — Classification: Support Vector Machines, Naïve Bayes, k-Nearest Neighbors and Decision Tree.
The different Error and Matrics used for Classification: Accuracy Score, Classification Matrix and Confusion Matrix.

In part 3 of our tutorial we will discuss Unsupervised Learning and how to use it with Supervised Learning. We will also learn how to perform cross-validation and the difference between over-fitting and under-fitting. After that, we will do a briefing about Reinforcement learning. Then we can start with Deep Learning.

Thanks for reading!