Machine Learning & Deep Learning Guide

Published in

Analytics Vidhya

12 min readNov 20, 2019

Welcome to part 3 of the Machine Learning & Deep Learning Guide where we learn and practice machine learning and deep learning without being overwhelmed by the concepts and mathematical rules.

Part 1: Key terms, Definitions and starting off with Supervised Learning (Linear Regression).
Part 2: Supervised Learning : Regression (SGD) and Classification (SVM, Naïve Bayes, KNN and Decision Tree).
Part 3: Unsupervised Learning (KMeans,PCA), Underfitting vs Overfitting and cross validation.
Part 4: Deep Learning: Definitions, Layers, Metrics and Loss, Optimizer and Regularization

Learning Objectives

In this part, we will discuss Unsupervised Learning with examples and how to use it with Supervised Learning. We will also learn how to perform cross-validation and the difference between over-fitting and under-fitting.

Unsupervised Learning

Till now we consider supervised learning, where we have a set of features (X) and labels (y) and we wanted to learn the mapping from features to labels. In unsupervised learning, we only have the features (X) and we want to find patterns in the data.

As mentioned in part 1, we split unsupervised learning into two concepts:

Clustering: Cluster the data into groups by similarities.
Dimensionality Reduction: reduce dimensionality to compress the data while maintaining its structure and usefulness.

Clustering

Cluster the data into groups by similarities. We will consider k-Means for clustering. It clusters data by trying to separate samples in n groups of equal variances, minimizing a criterion known as the inertia or within-cluster sum-of-squares. This algorithm requires the number of clusters to be specified.

You can download the complete Kaggle notebook from here

1. Data Definition: We will use Handwritten Digits dataset.

%matplotlib inline
from sklearn.datasets import load_digits
from sklearn.cluster import KMeans
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()  # for plot styling
from scipy.stats import modedigits = load_digits()
print(digits.data.shape)

Result: (1797, 64)

2. Algorithm Selection: We will use KMeans

kmeans = KMeans(n_clusters=10, random_state=0)
clusters = kmeans.fit_predict(digits.data)
print(kmeans.cluster_centers_.shape)

Result: (10, 64)

Let us plot the different results

fig, ax = plt.subplots(2, 5, figsize=(8, 3))
centers = kmeans.cluster_centers_.reshape(10, 8, 8)
for axi, center in zip(ax.flat, centers):
    axi.set(xticks=[], yticks=[])
    axi.imshow(center, interpolation='nearest', cmap=plt.cm.binary)

Cool, it was able to correctly identify most of the numbers. As you can see it wasn’t able to identify 1 and 8. Now to be able to test the accuracy we will add the corresponding labels.

labels = np.zeros_like(clusters)
for i in range(10):
    mask = (clusters == i)
    labels[mask] = mode(digits.target[mask])[0]

We will test the accuracy for the KMeans model. Then we will plot the confusion matrix.

from sklearn.metrics import accuracy_score
print(f"Accuracy for KMeans : {accuracy_score(digits.target, labels)}")

Result: Accuracy for KMeans : 0.7935447968836951

from sklearn.metrics import confusion_matrixprint(f"Confusion Matrix KMeans")mat = confusion_matrix(digits.target, labels)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False,
            xticklabels=digits.target_names,
            yticklabels=digits.target_names)
plt.xlabel('true label')
plt.ylabel('predicted label')
plt.show()

Dimensionality Reduction (Feature Elimination)

It is what it sounds like: we reduce the feature space by eliminating features. We will use Principal component analysis (PCA). It is a technique used to emphasize variation and bring out strong patterns in a dataset.

You can download the complete Kaggle notebook from here

Data Definition: We will use the Olivetti faces dataset which contains a set of face images taken between April 1992 and April 1994 at AT&T Laboratories Cambridge.

data=np.load("../input/olivetti_faces.npy")
labels=np.load("../input/olivetti_faces_target.npy")print(f"Shape of inputs: {data.shape}")
print(f"Shape of labels: {labels.shape}")
print(f"Unique values for labels: {np.unique(labels)}")

Result:
Shape of inputs: (400, 64, 64)
Shape of labels: (400,)
Unique values for labels: [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39]

So we have 400 images of 40 people. Each image has a dimension of 64*64.

Let us see the first image.

imshow(data[0])

Let us reshape the data.

X=data.reshape((data.shape[0],data.shape[1]*data.shape[2]))
print("After reshape:",X.shape)

Result:
After reshape: (400, 4096)

2. Train/Test split:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test=train_test_split(X, labels, test_size=0.25, stratify=labels, random_state=0)
print("X_train shape:",X_train.shape)
print("y_train shape:{}".format(y_train.shape))

Result:
X_train shape: (300, 4096)
y_train shape:(300,)

3. Algorithm Selection: We will use PCA

from sklearn.decomposition import PCA
pca=PCA()
pca.fit(X)plt.figure(1, figsize=(12,8))plt.plot(pca.explained_variance_, linewidth=2)
 
plt.xlabel('Components')
plt.ylabel('Explained Variaces')
plt.show()

As you can see the values after about 60 are the same. This means that there are 60 important components that we can depend on.

4. Training:

n_components=60
pca=PCA(n_components=n_components, whiten=True)
pca.fit(X)

Result:
PCA(copy=True, iterated_power=’auto’, n_components=60, random_state=None,
svd_solver=’auto’, tol=0.0, whiten=True)

We will plot average face. That is we will take the mean of the faces based on the calculated PCA.

fig,ax=plt.subplots(1,1,figsize=(8,8))
ax.imshow(pca.mean_.reshape((64,64)), cmap="gray")
ax.set_xticks([])
ax.set_yticks([])
ax.set_title('Average Face')

Print eigen faces

number_of_eigenfaces=len(pca.components_)
eigen_faces=pca.components_.reshape((number_of_eigenfaces, data.shape[1], data.shape[2]))cols=10
rows=int(number_of_eigenfaces/cols)
fig, axarr=plt.subplots(nrows=rows, ncols=cols, figsize=(15,15))
axarr=axarr.flatten()
for i in range(number_of_eigenfaces):
    axarr[i].imshow(eigen_faces[i],cmap="gray")
    axarr[i].set_xticks([])
    axarr[i].set_yticks([])
    axarr[i].set_title("eigen id:{}".format(i))
plt.suptitle("All Eigen Faces".format(10*"=", 10*"="))

Transform the training and testing sets using PCA.

X_train_pca=pca.transform(X_train)
X_test_pca=pca.transform(X_test)
print(f"Shape before {X_train.shape} vs shape after {X_train_pca.shape}")

Result:
Shape before (300, 4096) vs shape after (300, 60)

So now we were able to decrease the dimension of the inputs.

5. Prediction and Evaluation: We will use LogisticRegression to study the accuracy of the model after we applied the PCA transformation.

from sklearn.linear_model import LogisticRegression
from sklearn import metricsclf = LogisticRegression()
clf.fit(X_train_pca, y_train)
y_pred = clf.predict(X_test_pca)print("Accuracy score:{:.2f}".format(metrics.accuracy_score(y_test, y_pred)))

Result:
Accuracy score:0.95

Impressive we got an accuracy of 0.95. Now I will let you try to do the same by using dimension reduction with PCA. That means loading the data, splitting it into train and test then using LogisticRegression.

The result will be the same accuracy 0.95 but the time for the model to finish will be much longer.

This is the main goal, to decrease the dimension and keep important features without affecting the accuracy.

Feature Extraction

Say we have ten independent variables. In feature extraction, we create ten “new” independent variables, where each “new” independent variable is a combination of each of the ten “old” independent variables. However, we create these new independent variables in a specific way and order these new variables by how well they predict our dependent variable.

You can download the complete Kaggle notebook from here

I used the example from the amazing book: Python Data Science Handbook.
Implementation is also done by Jake VanderPlas.

We will do the following :

Obtain a set of image thumbnails of faces to constitute “positive” training samples.
Obtain a set of image thumbnails of non-faces to constitute “negative” training samples.
Extract Histogram of [Oriented Gradients (HOG)](https://en.wikipedia.org/wiki/Histogram_of_oriented_gradients) features from these training samples.
Train a linear SVM classifier on these samples.
For an “unknown” image, pass a sliding window across the image, using the model to evaluate whether that window contains a face or not.
If detections overlap, combine them into a single window.

1. Get positive images: We will use the Labeled Faces in the Wild (LFW) dataset. It is a collection of JPEG pictures of famous people collected over the internet, all details are available on the official website:

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import numpy as np
from sklearn.datasets import fetch_lfw_people
faces = fetch_lfw_people()
positive_patches = faces.images
print(f"Shape of positive data {positive_patches.shape}")

Result:
Shape of positive data (13233, 62, 47)

So we have more than 13000 images. Let us plot the first image

from skimage.io import imshow
imshow(faces.images[0])

2. Get negative images: Now we will prepare some images that don’t have faces in them. We will image and use 10 categories (‘camera’, ‘text’, ‘coins’, ‘moon’,’ page’, ‘clock’, ‘immunohistochemistry’,’Chelsea, ‘coffee’, ‘hubble_deep_field’)

from skimage import data, transform,color, featureimgs_to_use = ['camera', 'text', 'coins', 'moon',
               'page', 'clock', 'immunohistochemistry',
               'chelsea', 'coffee', 'hubble_deep_field']
images = [color.rgb2gray(getattr(data, name)())
          for name in imgs_to_use]

Let us plot them along with their labels

for i,im in enumerate(images):
    print(imgs_to_use[i])
    imshow(im) 
    plt.show()

Now we will use PatchExtractor which is a feature extractor for images.

from sklearn.feature_extraction.image import PatchExtractordef extract_patches(img, N, scale=1.0, patch_size=positive_patches[0].shape):
    extracted_patch_size = tuple((scale * np.array(patch_size)).astype(int))
    extractor = PatchExtractor(patch_size=extracted_patch_size,
                               max_patches=N, random_state=0)
    patches = extractor.transform(img[np.newaxis])
    if scale != 1:
        patches = np.array([transform.resize(patch, patch_size)
                            for patch in patches])
    return patchesnegative_patches = np.vstack([extract_patches(im, 1000, scale)
                              for im in images for scale in [0.5, 1.0, 2.0]])
print(f"Shape of negative data {negative_patches.shape}")

Result:
Shape of negative data (30000, 62, 47)

We have 30,000 image patches which do not contain faces. We will plot few:

fig, ax = plt.subplots(6, 10)
for i, axi in enumerate(ax.flat):
    axi.imshow(negative_patches[500 * i], cmap='gray')
    axi.axis('off')

3. Combining the images and applying HOG: Now will combine both positive and negative images. Then apply HOG to them. We will also add a label of 1 to images with faces and 0 for images without faces.

from itertools import chain
X_train = np.array([feature.hog(im)
                    for im in chain(positive_patches,
                                    negative_patches)])
y_train = np.zeros(X_train.shape[0])
y_train[:positive_patches.shape[0]] = 1
print(f"Shape after combining the images: {X_train.shape}")

Result: Shape after combining the images: (43233, 1215)

4. Training a support vector machine: We will Linear SVC.

from sklearn.svm import LinearSVCmodel = LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=1000,
          multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
          verbose=0)
model.fit(X_train, y_train)

5. Get new image: We will astronaut image from skimage.

import skimage.data
test_image = skimage.data.astronaut()
test_image = skimage.color.rgb2gray(test_image)
test_image = skimage.transform.rescale(test_image, 0.5)
test_image = test_image[:160, 40:180]plt.imshow(test_image, cmap='gray')
plt.axis('off');

Next, let’s create a window that iterates over patches of this image, and compute HOG features for each patch:

def sliding_window(img, patch_size=positive_patches[0].shape,
                   istep=2, jstep=2, scale=1.0):
    Ni, Nj = (int(scale * s) for s in patch_size)
    for i in range(0, img.shape[0] - Ni, istep):
        for j in range(0, img.shape[1] - Ni, jstep):
            patch = img[i:i + Ni, j:j + Nj]
            if scale != 1:
                patch = transform.resize(patch, patch_size)
            yield (i, j), patch
            
indices, patches = zip(*sliding_window(test_image))
patches_hog = np.array([feature.hog(patch) for patch in patches])
print(f"Patches Hog shape: {patches_hog.shape}")

Result: Patches Hog shape: (1911, 1215)

Finally, we can take these HOG-featured patches and use our model to evaluate whether each patch contains a face:

labels = model.predict(patches_hog)
print(f"labels: {labels.sum()}")

Result: labels: 49.0

6. Face Detection: We see that out of nearly 2,000 patches, we have found 49 detections. Let’s use the information we have about these patches to show where they lie on our test image, drawing them as rectangles:

fig, ax = plt.subplots()
ax.imshow(test_image, cmap='gray')
ax.axis('off')Ni, Nj = positive_patches[0].shape
indices = np.array(indices)for i, j in indices[labels == 1]:
    ax.add_patch(plt.Rectangle((j, i), Nj, Ni, edgecolor='red',
                               alpha=0.3, lw=2, facecolor='none'))

As you can see, using feature extraction and SVC we were able to create a face detection model.

Overfitting vs Underfitting

Consider that we have two sets of data. Training and Test set. We define the below:

Underfitting: this is where a model fails to capture important distinctions and patterns in the data, so it performs poorly even in training data. This happens when the model is not complex enough to capture the underlying trend in the data.

Overfitting: is where a model matches the training data almost perfectly but does poorly on tests and other new data. We say that it doesn’t generalize well to unseen data.

Solving UnderFitting:

Underfitting can be avoided by using more data and also reducing the features by feature selection or simply moving on and trying alternate machine learning algorithms

Solving Overfitting:

One solution was provided earlier when we used train_test_split to create a validation (holdout) set. But the drawback of this is that we have lost a portion of our data to the model training. For example, if we used validation size 0.3 this means we are holding out 30% of our training data. This has some luck because we might have missed some important or effective or new patterns of data while training.

A solution to this problem is a procedure called cross-validation (CV for short). A test set should still be held out for final evaluation, but the validation set is no longer needed when doing a CV.
In the basic approach, called k-fold CV, the training set is split into k smaller sets (other approaches are described below, but generally follow the same principles).
The following procedure is followed for each of the k “folds”:
1) A model is trained using k-1 of the folds as training data
2) The resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy).

In order to evaluate the result of cross-validation, we can use cross_val_score. It is easy to use we only set the cv parameter to the number of folds.

from sklearn.cross_validation import cross_val_score
cross_val_score(model, X, y, cv=5)

Bias-variance trade-off

Bias: is the amount of error introduced by an approximately real-world phenomenon with a simplified model. That is the difference between training set error and human level error or any other optimal error or base error.

Variance: is how much your model’s test error changes based on variation in the training data. It reflects the model’s sensitivity to the idiosyncrasies of the data set it was trained on.

As a model increases in complexity and it becomes flexible, its bias decreases (it does a good job of explaining the training data), but variance increases (it doesn’t generalize as well).

Technically speaking, underfitting happens when we have high bias. Overfitting occurs when we have high variance.

Undefitting (High-bias) vs Overfitting (High-variance)

Ultimately, in order to have a good model, you need one with low bias and low variance. This means we need to find the best hyper-parameters. We already mentioned this in part two of this series where we used GridSearchCV.

# Establish a model
model = SGDRegressor(learning_rate='optimal',penalty='l2')
from sklearn.model_selection import GridSearchCV
# Grid search - this will take about 1 minute.
param_grid = {
    'alpha': 10.0 ** -np.arange(1, 7),
    'loss': ['squared_loss', 'huber', 'epsilon_insensitive'],
}
clf = GridSearchCV(model, param_grid)
clf.fit(X_train, y_train)
print(f"Best Score: {round(clf.best_score_,3)}" )
print(f"Best Estimator: {clf.best_estimator_}" )
print(f"Best Params: {clf.best_params_}" )

However, it is sometimes helpful to plot the influence of a single hyperparameter on the training score and the validation score to find out whether the estimator is overfitting or underfitting for some hyperparameter values.
The function validation_curve can help in this case.

We can also use learning curve to see if adding more training data would increase the performance of our model. We can also check if it suffers from bias and variance.

Recap

We have reached the end of part 3 of our series. In this part we were able to learn:

Unsupervised Learning Clustering with KMeans
Dimensionality Reduction Dimensionality Reduction (Feature Elimination) with Principal component analysis (PCA)
Feature Extraction and used it to create a face detector
Overfitting vs Underfitting
Validation and Cross Validation
Bias-variance trade-off
Validation Curve and Learning Curves

Throughout the previous parts (1,2 and 3) of this tutorial, we covered the different aspects of supervised and unsupervised machine learning. In part 4 of our tutorial, we will start with Deep Learning.

Thanks for reading!

Machine Learning & Deep Learning Guide

Learning Objectives

Unsupervised Learning

Clustering

Dimensionality Reduction (Feature Elimination)

Feature Extraction

Overfitting vs Underfitting

Bias-variance trade-off

Recap

Reference Links:

Written by Mohammad Hatoum