Stories by Imad Dabbura on Medium

Conda Essentials Notes

Imad Dabbura — Mon, 18 Feb 2019 13:53:39 GMT

Conda in an open source package management system that works on all platforms. It is a tool that helps manage packages and environments for different programming languages. Develop a high level understanding of how Conda works helped me at so many levels especially when it comes to managing environments and make my work more reproducable. Below are the notes that I wrote down during my journey of learning Conda and I always refere back to them:

General

Conda packages are files and executables that can in principle contain images, data, noteboeeks, files, etc.
Conda mainly used in Python ecosystem; however, it can be used with other languages such R, Julia, Scala, etc.
When installing a package using Conda, it installs its dependencies with it. Also, Conda is able to figure out the platform you’re using without the need to specify the platform when installing packages.
When installing a package, Conda:

1. Checks the platform.

2. Checks the Python version.

3. Install the latest version of the package that is compatible with Python.

4. If it has dependencies, installs the latest versions of the dependencies that are also compatible with each other.

Under semantic versioning, software is labeled with a three-part version identifier of the form MAJOR.MINOR.PATCH; the label components are non-negative integers separated by periods. Assuming all software starts at version 0.0.0, the MAJOR version number is increased when significant new functionality is introduced (often with corresponding API changes). Increases in the MINOR version number generally reflect improvements (e.g., new features) that avoid backward-incompatible API changes. For instance, adding an optional argument to a function API (in a way that allows old code to run unchanged) is a change worthy of increasing the MINOR version number. An increment to the PATCH version number is approriate mostly for bug fixes that preserve the same MAJOR and MINOR revision numbers. Software patches do not typically introduce new features or change APIs at all (except sometimes to address security issues).
We can specify MAJOR, MAJOR.MINOR, or MAJOR.MINOR.PATCH when installing any package.
We can use logical operators to install versions of a package. Examples:

1. conda install 'python=3.6|3.7'.

2. conda install 'python=3.6|3.7*' .

3. conda install 'python>=3.6, <=3.7'.

Common Commands

To update a package, conda update pckg.
To uninstall a package, conda remove pckg.
To search what available versions of a specific package is available, use conda search pckg.
conda list will list all installed packages.
conda list -n env-name will list all packages in the environment env-name.
conda list pckg will give information about pckg.
When installing a pckg without including a channel, it defaults to the main channel that is maintained by Anaconda Inc.
There other channels where people can upload their packages to and we can reach to those channels when looking for installation such fastai. We use conda install -c fastai fastai. Here the channel is fastai and the pckg is also fastai.
conda search -c conda-forge -c fastai --override-channels --platform osx-64 fastai means:
Search for fastai in two channels: conda-forge, fastai.
override-channels means do not go to default main channel.
platform specify which platform.
Sometimes we don’t know the channel of the pckg, we can use anaconda search pckg that will return all the channels that the pckg is at and their versions.
conda-forge is almost as good as the main channels which is led by the community. It has a lot more packages than the main channel.
There is no system that rates channels, so be carefel when installing packages from any channel.
We can list all packages in a channel such as conda search -c conda-forge --override-channels that will list all packages for the conda-forge channel.

Environments

Environments are a good practice of documenting data science/software development work.
Environments are nothing more than a directory that contains all the packages so that when trying to import them, it imports them from this directory only. we can use conda env list to see all the available environments on our machine.
To get the packages from a specific environment by name, use conda list -n env-name. Otherwise, we get the packages from the current environment.
To activate an environment, use conda activate env-name. To deactivate, conda deactivate.
Environments usually don’t take a lot of space.
We can remove environments using conda env remove -n env-name.
To create an environment, use conda create -n env-name. We can also add additional package names to install after creation such as conda create -n env-name python=3.6* numpy>=1.1.
To export an environment, use conda env export -n env-name. This will return the output to the terminal. We can also export to a file. For that use conda env export -n env-name -f env-name.yml. The ‘.yml’ extension is strongly enouraged. Doing this will assure that all the packages used can be installed by others exactly.
We can create also an environment from .yml file using conda env create -f env-name.yml. Note also that if we only use conda env create, it will look for a file that has .yml extension and has the same name as env-name in the current local directory. Moreover, we can create the .yml file with doing the export ourselves and only specify what is important in our environments.

Originally published at imaddabbura.github.io on February 18, 2019.

K-means Clustering: Algorithm, Applications, Evaluation Methods, and Drawbacks

Imad Dabbura — Mon, 17 Sep 2018 17:39:22 GMT

Clustering

Clustering is one of the most common exploratory data analysis technique used to get an intuition about the structure of the data. It can be defined as the task of identifying subgroups in the data such that data points in the same subgroup (cluster) are very similar while data points in different clusters are very different. In other words, we try to find homogeneous subgroups within the data such that data points in each cluster are as similar as possible according to a similarity measure such as euclidean-based distance or correlation-based distance. The decision of which similarity measure to use is application-specific.

Clustering analysis can be done on the basis of features where we try to find subgroups of samples based on features or on the basis of samples where we try to find subgroups of features based on samples. We’ll cover here clustering based on features. Clustering is used in market segmentation; where we try to find customers that are similar to each other whether in terms of behaviors or attributes, image segmentation/compression; where we try to group similar regions together, document clustering based on topics, etc.

Unlike supervised learning, clustering is considered an unsupervised learning method since we don’t have the ground truth to compare the output of the clustering algorithm to the true labels to evaluate its performance. We only want to try to investigate the structure of the data by grouping the data points into distinct subgroups.

In this post, we will cover only Kmeans which is considered as one of the most used clustering algorithms due to its simplicity.

Kmeans Algorithm

Kmeans algorithm is an iterative algorithm that tries to partition the dataset into Kpre-defined distinct non-overlapping subgroups (clusters) where each data point belongs to only one group. It tries to make the intra-cluster data points as similar as possible while also keeping the clusters as different (far) as possible. It assigns data points to a cluster such that the sum of the squared distance between the data points and the cluster’s centroid (arithmetic mean of all the data points that belong to that cluster) is at the minimum. The less variation we have within clusters, the more homogeneous (similar) the data points are within the same cluster.

The way kmeans algorithm works is as follows:

Specify number of clusters K.
Initialize centroids by first shuffling the dataset and then randomly selecting K data points for the centroids without replacement.
Keep iterating until there is no change to the centroids. i.e assignment of data points to clusters isn’t changing.

Compute the sum of the squared distance between data points and all centroids.
Assign each data point to the closest cluster (centroid).
Compute the centroids for the clusters by taking the average of the all data points that belong to each cluster.

The approach kmeans follows to solve the problem is called Expectation-Maximization. The E-step is assigning the data points to the closest cluster. The M-step is computing the centroid of each cluster. Below is a break down of how we can solve it mathematically (feel free to skip it).

The objective function is:

where wik=1 for data point xi if it belongs to cluster k; otherwise, wik=0. Also, μk is the centroid of xi’s cluster.

It’s a minimization problem of two parts. We first minimize J w.r.t. wik and treat μk fixed. Then we minimize J w.r.t. μk and treat wik fixed. Technically speaking, we differentiate J w.r.t. wik first and update cluster assignments (E-step). Then we differentiate J w.r.t. μk and recompute the centroids after the cluster assignments from previous step (M-step). Therefore, E-step is:

In other words, assign the data point xi to the closest cluster judged by its sum of squared distance from cluster’s centroid.

And M-step is:

Which translates to recomputing the centroid of each cluster to reflect the new assignments.

Few things to note here:

Since clustering algorithms including kmeans use distance-based measurements to determine the similarity between data points, it’s recommended to standardize the data to have a mean of zero and a standard deviation of one since almost always the features in any dataset would have different units of measurements such as age vs income.
Given kmeans iterative nature and the random initialization of centroids at the start of the algorithm, different initializations may lead to different clusters since kmeans algorithm may stuck in a local optimum and may not converge to global optimum. Therefore, it’s recommended to run the algorithm using different initializations of centroids and pick the results of the run that that yielded the lower sum of squared distance.
Assignment of examples isn’t changing is the same thing as no change in within-cluster variation:

Implementation

We’ll use simple implementation of kmeans here to just illustrate some concepts. Then we will use sklearn implementation that is more efficient take care of many things for us.

https://medium.com/media/4caa123d4fdc5c5085cefdaceb86bd89/href

Applications

kmeans algorithm is very popular and used in a variety of applications such as market segmentation, document clustering, image segmentation and image compression, etc. The goal usually when we undergo a cluster analysis is either:

Get a meaningful intuition of the structure of the data we’re dealing with.
Cluster-then-predict where different models will be built for different subgroups if we believe there is a wide variation in the behaviors of different subgroups. An example of that is clustering patients into different subgroups and build a model for each subgroup to predict the probability of the risk of having heart attack.

In this post, we’ll apply clustering on two cases:

Geyser eruptions segmentation (2D dataset).
Image compression.

Kmeans on Geyser’s Eruptions Segmentation

We’ll first implement the kmeans algorithm on 2D dataset and see how it works. The dataset has 272 observations and 2 features. The data covers the waiting time between eruptions and the duration of the eruption for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA. We will try to find K subgroups within the data points and group them accordingly. Below is the description of the features:

eruptions (float): Eruption time in minutes.
waiting (int): Waiting time to next eruption.

Let’s plot the data first:

https://medium.com/media/8a96d5a87d4d170b0f0b528b1c3dd775/href

We’ll use this data because it’s easy to plot and visually spot the clusters since its a 2-dimension dataset. It’s obvious that we have 2 clusters. Let’s standardize the data first and run the kmeans algorithm on the standardized data with K=2.

https://medium.com/media/9c653be608b0a4d7dadcb7662b401c24/href

The above graph shows the scatter plot of the data colored by the cluster they belong to. In this example, we chose K=2. The symbol ‘*‘ is the centroid of each cluster. We can think of those 2 clusters as geyser had different kinds of behaviors under different scenarios.

Next, we’ll show that different initializations of centroids may yield to different results. I’ll use 9 different random_state to change the initialization of the centroids and plot the results. The title of each plot will be the sum of squared distance of each initialization.

As a side note, this dataset is considered very easy and converges in less than 10 iterations. Therefore, to see the effect of random initialization on convergence, I am going to go with 3 iterations to illustrate the concept. However, in real world applications, datasets are not at all that clean and nice!

https://medium.com/media/f2b37473bdbba552b07ee8e2f49d3dfe/href

As the graph above shows that we only ended up with two different ways of clusterings based on different initializations. We would pick the one with the lowest sum of squared distance.

Kmeans on Image Compression

In this part, we’ll implement kmeans to compress an image. The image that we’ll be working on is 396 x 396 x 3. Therefore, for each pixel location we would have 3 8-bit integers that specify the red, green, and blue intensity values. Our goal is to reduce the number of colors to 30 and represent (compress) the photo using those 30 colors only. To pick which colors to use, we’ll use kmeans algorithm on the image and treat every pixel as a data point. That means reshape the image from height x width x channels to (height * width) x channel, i,e we would have 396 x 396 = 156,816 data points in 3-dimensional space which are the intensity of RGB. Doing so will allow us to represent the image using the 30 centroids for each pixel and would significantly reduce the size of the image by a factor of 6. The original image size was 396 x 396 x 24 = 3,763,584 bits; however, the new compressed image would be 30 x 24 + 396 x 396 x 4 = 627,984 bits. The huge difference comes from the fact that we’ll be using centroids as a lookup for pixels’ colors and that would reduce the size of each pixel location to 4-bit instead of 8-bit.

From now on we will be using sklearn implementation of kmeans. Few thing to note here:

n_init is the number of times of running the kmeans with different centroid’s initialization. The result of the best one will be reported.
tol is the within-cluster variation metric used to declare convergence.
The default of init is k-means++ which is supposed to yield a better results than just random initialization of centroids.

https://medium.com/media/9442dd9df8aa9ff9bb88e81a7ee58987/href

We can see the comparison between the original image and the compressed one. The compressed image looks close to the original one which means we’re able to retain the majority of the characteristics of the original image. With smaller number of clusters we would have higher compression rate at the expense of image quality. As a side note, this image compression method is called lossy data compression because we can’t reconstruct the original image from the compressed image.

Evaluation Methods

Contrary to supervised learning where we have the ground truth to evaluate the model’s performance, clustering analysis doesn’t have a solid evaluation metric that we can use to evaluate the outcome of different clustering algorithms. Moreover, since kmeans requires k as an input and doesn’t learn it from data, there is no right answer in terms of the number of clusters that we should have in any problem. Sometimes domain knowledge and intuition may help but usually that is not the case. In the cluster-predict methodology, we can evaluate how well the models are performing based on different K clusters since clusters are used in the downstream modeling.

In this post we’ll cover two metrics that may give us some intuition about k:

Elbow method
Silhouette analysis

Elbow Method

Elbow method gives us an idea on what a good k number of clusters would be based on the sum of squared distance (SSE) between data points and their assigned clusters’ centroids. We pick k at the spot where SSE starts to flatten out and forming an elbow. We’ll use the geyser dataset and evaluate SSE for different values of k and see where the curve might form an elbow and flatten out.

https://medium.com/media/c887f0579495f48fccbca8dd745012aa/href

The graph above shows that k=2 is not a bad choice. Sometimes it’s still hard to figure out a good number of clusters to use because the curve is monotonically decreasing and may not show any elbow or has an obvious point where the curve starts flattening out.

Silhouette Analysis

Silhouette analysis can be used to determine the degree of separation between clusters. For each sample:

Compute the average distance from all data points in the same cluster (ai).
Compute the average distance from all data points in the closest cluster (bi).
Compute the coefficient:

The coefficient can take values in the interval [-1, 1].

If it is 0 –> the sample is very close to the neighboring clusters.
It it is 1 –> the sample is far away from the neighboring clusters.
It it is -1 –> the sample is assigned to the wrong clusters.

Therefore, we want the coefficients to be as big as possible and close to 1 to have a good clusters. We’ll use here geyser dataset again because its cheaper to run the silhouette analysis and it is actually obvious that there is most likely only two groups of data points.

https://medium.com/media/633aeea8968e4a1cb89b5ff5263849f4/href

As the above plots show, n_clusters=2 has the best average silhouette score of around 0.75 and all clusters being above the average shows that it is actually a good choice. Also, the thickness of the silhouette plot gives an indication of how big each cluster is. The plot shows that cluster 1 has almost double the samples than cluster 2. However, as we increased n_clusters to 3 and 4, the average silhouette score decreased dramatically to around 0.48 and 0.39 respectively. Moreover, the thickness of silhouette plot started showing wide fluctuations. The bottom line is: Good n_clusters will have a well above 0.5 silhouette average score as well as all of the clusters have higher than the average score.

Drawbacks

Kmeans algorithm is good in capturing structure of the data if clusters have a spherical-like shape. It always try to construct a nice spherical shape around the centroid. That means, the minute the clusters have a complicated geometric shapes, kmeans does a poor job in clustering the data. We’ll illustrate three cases where kmeans will not perform well.

First, kmeans algorithm doesn’t let data points that are far-away from each other share the same cluster even though they obviously belong to the same cluster. Below is an example of data points on two different horizontal lines that illustrates how kmeans tries to group half of the data points of each horizontal lines together.

https://medium.com/media/76f314e2337c4f64d96c472dae4b743f/href

Kmeans considers the point ‘B’ closer to point ‘A’ than point ‘C’ since they have non-spherical shape. Therefore, points ‘A’ and ‘B’ will be in the same cluster but point ‘C’ will be in a different cluster. Note the Single Linkage hierarchical clustering method gets this right because it doesn’t separate similar points).

Second, we’ll generate data from multivariate normal distributions with different means and standard deviations. So we would have 3 groups of data where each group was generated from different multivariate normal distribution (different mean/standard deviation). One group will have a lot more data points than the other two combined. Next, we’ll run kmeans on the data with K=3 and see if it will be able to cluster the data correctly. To make the comparison easier, I am going to plot first the data colored based on the distribution it came from. Then I will plot the same data but now colored based on the clusters they have been assigned to.

https://medium.com/media/3571b1dd578baaecdb37689b727afa0c/href

Looks like kmeans couldn’t figure out the clusters correctly. Since it tries to minimize the within-cluster variation, it gives more weight to bigger clusters than smaller ones. In other words, data points in smaller clusters may be left away from the centroid in order to focus more on the larger cluster.

Last, we’ll generate data that have complicated geometric shapes such as moons and circles within each other and test kmeans on both of the datasets.

https://medium.com/media/d03437fe0bde2cec42f5d96ddbf81688/href

As expected, kmeans couldn’t figure out the correct clusters for both datasets. However, we can help kmeans perfectly cluster these kind of datasets if we use kernel methods. The idea is we transform to higher dimensional representation that make the data linearly separable (the same idea that we use in SVMs). Different kinds of algorithms work very well in such scenarios such as SpectralClustering, see below:

https://medium.com/media/d6a899e13682c11d21719ec083ed774a/href

Conclusion

Kmeans clustering is one of the most popular clustering algorithms and usually the first thing practitioners apply when solving clustering tasks to get an idea of the structure of the dataset. The goal of kmeans is to group data points into distinct non-overlapping subgroups. It does a very good job when the clusters have a kind of spherical shapes. However, it suffers as the geometric shapes of clusters deviates from spherical shapes. Moreover, it also doesn’t learn the number of clusters from the data and requires it to be pre-defined. To be a good practitioner, it’s good to know the assumptions behind algorithms/methods so that you would have a pretty good idea about the strength and weakness of each method. This will help you decide when to use each method and under what circumstances. In this post, we covered both strength, weaknesses, and some evaluation methods related to kmeans.

Below are the main takeaways:

Scale/standardize the data when applying kmeans algorithm.
Elbow method in selecting number of clusters doesn’t usually work because the error function is monotonically decreasing for all ks.
Kmeans gives more weight to the bigger clusters.
Kmeans assumes spherical shapes of clusters (with radius equal to the distance between the centroid and the furthest data point) and doesn’t work well when clusters are in different shapes such as elliptical clusters.
If there is overlapping between clusters, kmeans doesn’t have an intrinsic measure for uncertainty for the examples belong to the overlapping region in order to determine for which cluster to assign each data point.
Kmeans may still cluster the data even if it can’t be clustered such as data that comes from uniform distributions.

The notebook that created this post can be found here.

Originally published at imaddabbura.github.io on September 17, 2018.

K-means Clustering: Algorithm, Applications, Evaluation Methods, and Drawbacks was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Coding Neural Network — Dropout

Imad Dabbura — Sun, 20 May 2018 00:00:00 GMT

Figure 1: Dropout

Dropout is a regularization technique. On each iteration, we randomly shut down some neurons (units) on each layer and don’t use those neurons in both forward propagation and back-propagation. Since the units that will be dropped out on each iteration will be random, the learning algorithm will have no idea which neurons will be shut down on every iteration; therefore, force the learning algorithm to spread out the weights and not focus on some specific feattures (units). Moreover, dropout help improving generalization error by:

Since we drop some units on each iteration, this will lead to smaller network which in turns means simpler network (regularization).
Can be seen as an approximation to bagging techniques. Each iteration can be viewed as different model since we’re dropping randomly different units on each layer. This means that the error would be the average of errors from all different models (iterations). Therefore, averaging errors from different models especially if those errors are uncorrelated would reduce the overall errors. In the worst case where errors are perfectly correlated, averaging among all models won’t help at all; however, we know that in practice errors have some degree of uncorrelation. As result, it will always improve generalization error.

We can use different probabilities on each layer; however, the output layer would always have keep_prob = 1 and the input layer has high keep_prob such as 0.9 or 1. If a hidden layer has keep_prob = 0.8, this means that; on each iteration, each unit has 80% probablitity of being included and 20% probability of being dropped out.

Dropout is used a lot in computer vision problems because we have a lot of features and not a lot of data. Also, features (pixels) next to each other usually don’t add a lot of information. Therefore, models always suffer from overfitting.

To illustrate how dropout helps us reduce generalization error, we’ll use the same dataset we’ve used in the previous posts. The dataset has images for cats and non-cat. We’ll try to build a neural network to classify if the image has cat or not. Each image is 64 x 64 pixels on RGB scale. Let’s import the data and take a look at the shape as well as a sample of a cat image from the training set.

# Import training data
train_dataset = h5py.File("../data/train_catvnoncat.h5")
X_train = np.array(train_dataset["train_set_x"])
Y_train = np.array(train_dataset["train_set_y"])

# Plot a sample image
plt.imshow(X_train[50])
plt.axis("off");

# Import test data
test_dataset = h5py.File("../data/test_catvnoncat.h5")
X_test = np.array(test_dataset["test_set_x"])
Y_test = np.array(test_dataset["test_set_y"])

# Transform data
X_train = X_train.reshape(209, -1).T
X_train = X_train / 255
Y_train = Y_train.reshape(-1, 209)

X_test = X_test.reshape(50, -1).T
X_test = X_test / 255
Y_test = Y_test.reshape(-1, 50)

# print the new shape of both training and test datasets
print("Training data dimensions:")
print("X's dimension: {}, Y's dimension: {}".format(X_train.shape, Y_train.shape))
print("Test data dimensions:")
print("X's dimension: {}, Y's dimension: {}".format(X_test.shape, Y_test.shape))

Training data dimensions:
X's dimension: (12288, 209), Y's dimension: (1, 209)
Test data dimensions:
X's dimension: (12288, 50), Y's dimension: (1, 50)

Figure 2: Sample image

Now, we’ll write the functions needed to apply dropout on both forward propagation and back-propagation. Note that we’ll utilize the functions we wrote in previous posts such as initialize_parameters.

def drop_out_matrices(layers_dims, m, keep_prob):
    np.random.seed(1)
    D = {}
    L = len(layers_dims)

    for l in range(L):
        # initialize the random values for the dropout matrix
        D[str(l)] = np.random.rand(layers_dims[l], m)
        # Convert it to 0/1 to shut down neurons corresponding to each element
        D[str(l)] = D[str(l)] < keep_prob[l]
        assert(D[str(l)].shape == (layers_dims[l], m))
    return D


def L_model_forward(
   X, parameters, D, keep_prob, hidden_layers_activation_fn="relu"):
    A = X                           # since input matrix A0
    A = np.multiply(A, D[str(0)])
    A /= keep_prob[0]
    caches = []                     # initialize the caches list
    L = len(parameters) // 2        # number of layer in the network

    for l in range(1, L):
        A_prev = A
        A, cache = linear_activation_forward(
            A_prev, parameters["W" + str(l)], parameters["b" + str(l)],
            hidden_layers_activation_fn)
        # shut down some units
        A = np.multiply(A, D[str(l)])
        # scale that value of units to keep expected value the same
        A /= keep_prob[l]
        caches.append(cache)

    AL, cache = linear_activation_forward(
        A, parameters["W" + str(L)], parameters["b" + str(L)], "sigmoid")
    AL = np.multiply(AL, D[str(L)])
    AL /= keep_prob[L]
    caches.append(cache)
    assert(AL.shape == (1, X.shape[1]))

    return AL, caches


def L_model_backward(
   AL, Y, caches, D, keep_prob, hidden_layers_activation_fn="relu"):
    Y = Y.reshape(AL.shape)
    L = len(caches)
    grads = {}

    # dA for output layer
    dAL = np.divide(AL - Y, np.multiply(AL, 1 - AL))
    dAL = np.multiply(dAL, D[str(L)])
    dAL /= keep_prob[L]

    grads["dA" + str(L - 1)], grads["dW" + str(L)], grads[
        "db" + str(L)] = linear_activation_backward(
            dAL, caches[L - 1], "sigmoid")
    grads["dA" + str(L - 1)] = np.multiply(
        grads["dA" + str(L - 1)], D[str(L - 1)])
    grads["dA" + str(L - 1)] /= keep_prob[L - 1]

    for l in range(L - 1, 0, -1):
        current_cache = caches[l - 1]
        grads["dA" + str(l - 1)], grads["dW" + str(l)], grads[
            "db" + str(l)] = linear_activation_backward(
                grads["dA" + str(l)], current_cache,
                hidden_layers_activation_fn)

        grads["dA" + str(l - 1)] = np.multiply(
            grads["dA" + str(l - 1)], D[str(l - 1)])
        grads["dA" + str(l - 1)] /= keep_prob[l - 1]

    return grads


def model_with_dropout(
        X, Y, layers_dims, keep_prob, learning_rate=0.01, num_iterations=3000,
        print_cost=True, hidden_layers_activation_fn="relu"):
    # get number of examples
    m = X.shape[1]

    # to get consistents output
    np.random.seed(1)

    # initialize parameters
    parameters = initialize_parameters(layers_dims)

    # intialize cost list
    cost_list = []

    # implement gradient descent
    for i in range(num_iterations):
        # Initialize dropout matrices
        D = drop_out_matrices(layers_dims, m, keep_prob)

        # compute forward propagation
        AL, caches = L_model_forward(
            X, parameters, D, keep_prob, hidden_layers_activation_fn)

        # compute regularized cost
        cost = compute_cost(AL, Y)

        # compute gradients
        grads = L_model_backward(
            AL, Y, caches, D, keep_prob, hidden_layers_activation_fn)

        # update parameters
        parameters = update_parameters(parameters, grads, learning_rate)

        # print cost
        if (i + 1) % 100 == 0 and print_cost:
            print(f"The cost after {i + 1} iterations : {cost:.4f}.")
        # append cost
        if i % 100 == 0:
            cost_list.append(cost)

    # plot the cost curve
    plt.plot(cost_list)
    plt.xlabel("Iteration (per hundreds)")
    plt.ylabel("Cost")
    plt.title(f"Cost curve for the learning rate = {learning_rate}")

    return parameters

Finally, we’re ready to build our neural network. First, we’ll build one fully connected network without dropout. That is to say, keep_prob = 1. Next, we’ll build another network where keep_prob < 1. Lastly, we’ll compare the generalization error of both networks and see how dropout technique can help us in improving our generalization error.

# setup layers dimensions, number of examples, and keep probabilities list
m = X_train.shape[0]
keep_prob = [1, 1, 1, 1]
layers_dims = [m, 10, 10, 1]

# train NN with no dropout
parameters = model_with_dropout(X_train, Y_train, layers_dims,      keep_prob=keep_prob, learning_rate=0.03, num_iterations=1000, hidden_layers_activation_fn="relu")

# print the test accuracy
print("The training accuracy rate: {}".format(accuracy(X_train, parameters, Y_train, "relu")[-7:]))
print("The test accuracy rate: {}".format(accuracy(X_test, parameters, Y_test, "relu")[-7:]))

The cost after 100 iterations : 0.6555.
The cost after 200 iterations : 0.6468.
The cost after 300 iterations : 0.6447.
The cost after 400 iterations : 0.6442.
The cost after 500 iterations : 0.6440.
The cost after 600 iterations : 0.6440.
The cost after 700 iterations : 0.6440.
The cost after 800 iterations : 0.6440.
The cost after 900 iterations : 0.6440.
The cost after 1000 iterations : 0.6440.
The training accuracy rate: 65.55%.
The test accuracy rate: 34.00%.

Figure 3: Cost curve with no dropout

# setup keep probabilities list
keep_prob = [1, 0.5, 0.5, 1]

# train NN with no dropout
parameters = model_with_dropout(X_train, Y_train, layers_dims,      keep_prob=keep_prob, learning_rate=0.03, num_iterations=1000, hidden_layers_activation_fn="relu")

# print the test accuracy
print("The training accuracy rate: {}".format(accuracy(X_train, parameters, Y_train, "relu")[-7:]))
print("The test accuracy rate: {}".format(accuracy(X_test, parameters, Y_test, "relu")[-7:]))

The cost after 100 iterations : 0.6555.
The cost after 200 iterations : 0.6467.
The cost after 300 iterations : 0.6445.
The cost after 400 iterations : 0.6437.
The cost after 500 iterations : 0.6412.
The cost after 600 iterations : 0.6338.
The cost after 700 iterations : 0.6108.
The cost after 800 iterations : 0.5367.
The cost after 900 iterations : 0.4322.
The cost after 1000 iterations : 0.3114.
The training accuracy rate: 74.16%.
The test accuracy rate: 44.00%.

Figure 4: Cost curve with dropout

As the results above showed, the network with dropout improved on test accuracy rate by 30%. Note that this is just an illustrative example to show the effectiveness of the dropout technique. We chose an arbitrary probabilities in this example; however, we can tune the dropout probabilities on each layer to yield the best validation loss and accuracy.

Conclusion

Dropout is a very effective regularization technique that is used a lot in Convolutional Neural Networks. Below are some takeaways:

Set keep_prob = 1 when using gradient checking; otherwise, it won’t work.
Dropout is used only during training. Don’t use it when testing/predicting new examples.
The lowest the keep_prob → the simpler the neural network. As keep_prob decreases, the bias increases and the variance decreases. Therefore, layers with more neurons are expected to have lower keep_prob to avoid overfitting.
It’s computationally a cheap way to improve generalization error and help resolve overfitting.
One can tune keep_prob to get the best results out of the task at hand.

The source code that created this post can be found here. The post is inspired by deeplearning.ai courses.

Originally published at imaddabbura.github.io on May 20, 2018.

Coding Neural Network — Dropout was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Coding Neural Network — Regularization

Imad Dabbura — Tue, 08 May 2018 00:00:00 GMT

Source

Bias-Variance Trade-off

Generalization (test) error is the most important metric in Machine/Deep Learning. It gives us an estimate on the performance of the model on unseen data. Test error is decomposed into 3 parts (see figure 1): Variance, Squared-Bias, and Irreducible Error.

Figure 1: Bias and variance as a function of model complexity (flexibility). Source

Models with high bias are not complex enough (too simple) for the data and tend to underfit. The simplest model is taking the average (mode) of target variable and assign it to all predictions. On the contrary, models with high variance overfit the training data by closely follow (mimick) the training data where the learning algorithm will follow the signal and the noise. Note that as the complexity (flexibility) of the model increases ⟹ the model will become less interpretable such as Neural Networks. Below is the bias-variance decomposition:

Where:

var(ε): Irreducible error that resulted from omitted features and unmeasured variation with each example.
Bias(ŷ): Error that is introduced by approximating a real-life problem with a simple model.
var(ŷ): amount by which ŷ would change if we estimated it using different data set.

Therefore, we can control only the variance and the bias of the ŷ BUT NOT irreducible error. As a result, our job is to try to estimate the right level of complexity to achieve the lowest test error.

Regularization

Regularization adds stability to the learning algorithm by making it less sensitive to the training data and processes. Since we don’t know and have no access to the true function that we can use to compare our estimated function with it, the best strategy would be to build a very complex model that fits the training data really well (overfitting) and regularize it so that it would have a good generalization (test) error. When using regularization, we try to reduce the generalization error and that may lead to increase the training error in the process which is okay because what we care about is how well the model generalizes. With regularization, we try to bring back the very complex model that suffers from overfitting to a good model by increasing bias and reducing variance. This builds on the assumption that complex model has large parameters and simple model has small parameters.

Below are some methods used for regularization:

L2 Parameter Regularization: It’s also known as weight decay. This method adds L2 norm penalty to the objective function to drive the weights towards the origin. Even though this method shrinks all weights by the same proportion towards zero; however, it will never make any weight to be exactly zero.
L1 Parameter Regularization (Lasso): It can be seen as a feature selection method because; in contrast to L2 regularization, some weights will be actually zero. It shrinks all weights by the same amount by adding L1 norm penalty to the objective function.
Dropout: Dropout can be seen as an approximation to bagging techniques. On each iteration, we randomly shut down some neurons on each layer and don’t use those neurons in both forward propagation and back-propagation. This will force the neural network to spread out weights and not focus on specific neurons because it will never know which neurons will show up on each iteration. Therefore, it can be seen as training different model on each iteration. Also, since we drop some neurons on each iteration, this will lead to smaller network which in turns means simpler network.
Augmentation: Add fake data by using the training examples and adding distortions to them such as rescaling and rotating the images in the case of image recognition. The idea here is that it’s always better to train the model on more data to achieve better performance. Note that augmented examples don’t add much information to the model as much as independent examples do but still it’s a valid alternative when collecting more data is not feasible.
Early Stopping: This method tries to optimize the cost function and regularize it so that it would have lower generalization error. The way it works is that on each iteration we record the validation error. If the validation error improves, we store a copy of the parameters and will continue until the optimization algorithm terminates. It’s a good method if computational time and resources is an issue for us.

In this post, we’ll cover L2 parameter regularization.

L2 Parameter Regularization

We normally don’t regularize bias and regularize weights only. We can use hessian matrix and it’s eigenvalues and eigenvectors to see the sensitivity of the weights to the weight decay. The weight wi will be rescaled using (λi/λi + α) where λi (eigenvalue) measures the sensitivity of hessian matrix in that direction (eigenvector) and α is the regularized hyperparameter. Therefore,

If λi ≫ α, the cost function is very sensitive in that direction and the corresponding weight reduces the cost significantly ⟹ don’t decay (shrink) much.
If λi≪ α, the cost function is not sensitive in that direction and the corresponding weight doesn’t reduce the cost significantly ⟹ decay (shrink) away towards zero.

The objective function (binary cross-entropy) would then change from:

To:

Also, the new gradients and the update equation would be:

Note that here α is the learning rate and λ is the regularized hyperparameter. As λ increases, the bias increases (and the model becomes less flexible) with the following extreme cases (see figure 2):

λ = 0, no regularization.
λ → ∞, model becomes very simple where all weights are essentially zero. In the case of regression, we would end-up with the intercept only which is equal to the average of the target variable.

Figure 2: Model complexity (underfitting/overfitting) as a function of regularization parameter λ

It sometimes maybe helpful to see how L2 parameter regularization works using normal equation. The normal quation is:

This means that:

Adding λ to the variance would decrease the weight since

Even if X^TX is not invertible, adding λ to each feature will make it full rank matrix ⟹ invertible.

To illustrate how regularization helps us reduce generalization error, we’ll use the cats_vs_dogs dataset. The dataset has images for cats and dogs. We’ll try to build a neural network to classify if the image has a cat or a dog. Each image is 64 x 64 pixels on RGB scale.

We’ll be using functions we wrote in “Coding Neural Network — Forward Propagation and Backpropagation” post to initialize parameters, compute forward propagation, cross-entropy cost, gradients, etc.

Let’s import the data and take a look at the shape as well as a sample of a cat image from the training set.

https://medium.com/media/24e345a84eb9e7ed6e311fcb79599cd6/href

Training data dimensions:
X's dimension: (12288, 209), Y's dimension: (1, 209)
Test data dimensions:
X's dimension: (12288, 50), Y's dimension: (1, 50)

Figure 3: Sample image

The training set has 209 examples and the test set has 50 examples. Let’s first write all the helper functions that would help us write the multi-layer neural network.

https://medium.com/media/8fd2b84f7df0df6f87c00f5ba06c1971/href https://medium.com/media/c37b05108484b745d667bde9388c7a55/href https://medium.com/media/41348499a6c18a49c03529933f416c85/href

Now we’re ready to train the neural network. We’ll first build a neural network with no regularization and then one with regularization to see which one has lower generalization error. Note that λ should be tuned to get the best results but we’ll here choose an arbitrary value to illustrate the concept. Both neural networks would have 2 hidden layers where each hidden layer has 5 units.

# set up layers dimensions
layers_dims = [X_train.shape[0], 5, 5, 1]

# train NN
parameters = model_with_regularization(X_train, Y_train, layers_dims, learning_rate=0.03, num_epochs=2500, print_cost=True, hidden_layers_activation_fn="tanh", lambd=0)

print("The training accuracy rate: {}".format(accuracy(X_train, parameters, Y_train, "tanh")[-7:]))

print("The test accuracy rate: {}".format(accuracy(X_test, parameters, Y_test, "tanh")[-7:]))

The cost after 100 iterations: 0.6555634398145331
The cost after 200 iterations: 0.6467746423961933
The cost after 300 iterations: 0.6446638811282552
The cost after 400 iterations: 0.6441400737542232
The cost after 500 iterations: 0.6440063101787575
The cost after 600 iterations: 0.6439697872317176
The cost after 700 iterations: 0.6439570623358253
The cost after 800 iterations: 0.6439491872993496
The cost after 900 iterations: 0.6439407592837082
The cost after 1000 iterations: 0.6439294591543208
The cost after 1100 iterations: 0.6439131091764411
The cost after 1200 iterations: 0.6438883396380859
The cost after 1300 iterations: 0.6438489715870495
The cost after 1400 iterations: 0.6437825798034876
The cost after 1500 iterations: 0.6436617691190204
The cost after 1600 iterations: 0.6434191397054715
The cost after 1700 iterations: 0.642864008138056
The cost after 1800 iterations: 0.6413476000796884
The cost after 1900 iterations: 0.6360827945885947
The cost after 2000 iterations: 0.6124050450908987
The cost after 2100 iterations: 0.511236045905345
The cost after 2200 iterations: 0.5287658028657057
The cost after 2300 iterations: 0.43124104856359174
The cost after 2400 iterations: 0.38213869447364884
The cost after 2500 iterations: 0.3386708692392079

The training accuracy rate: 82.30%.
The test accuracy rate: 78.00%.

Figure 4: Cost curve with no regularization

The training accuracy is 82.30% but the test accuracy is 78%. The difference between training and test accuracy is not that much, i.e. we don’t have a lot of overfitting. Therefore, a little bit of regularization may help such as λ= 0.02. Values of λs that practitioners recommend are: 0.02, 0.04, 0.08, 0.16, 0.32, 0.64, 1.28, 2.56, 5.12, 10.24.

# train NN with regularization
parameters = model_with_regularization(X_train, Y_train, layers_dims, learning_rate=0.03, num_epochs=2500, print_cost=True, hidden_layers_activation_fn="tanh", lambd=0.02)

print("The training accuracy rate: {}".format(accuracy(X_train, parameters, Y_train, "tanh")[-7:]))

print("The test accuracy rate: {}".format(accuracy(X_test, parameters, Y_test, "tanh")[-7:]))

The cost after 100 iterations: 0.6558634554205135
The cost after 200 iterations: 0.6470807090618383
The cost after 300 iterations: 0.6449737235917311
The cost after 400 iterations: 0.6444519406797673
The cost after 500 iterations: 0.6443191828114609
The cost after 600 iterations: 0.6442831256251426
The cost after 700 iterations: 0.6442705985766486
The cost after 800 iterations: 0.6442628048800636
The cost after 900 iterations: 0.6442544325786784
The cost after 1000 iterations: 0.6442432311807257
The cost after 1100 iterations: 0.6442270988055475
The cost after 1200 iterations: 0.6442027847231018
The cost after 1300 iterations: 0.6441643410411311
The cost after 1400 iterations: 0.6440998547029029
The cost after 1500 iterations: 0.6439832000181198
The cost after 1600 iterations: 0.6437505375793907
The cost after 1700 iterations: 0.6432228625403317
The cost after 1800 iterations: 0.6417982979158361
The cost after 1900 iterations: 0.6369273437378263
The cost after 2000 iterations: 0.6152774362019153
The cost after 2100 iterations: 0.5207828651496548
The cost after 2200 iterations: 0.5145012356446598
The cost after 2300 iterations: 0.40757220705507585
The cost after 2400 iterations: 0.517757346098386
The cost after 2500 iterations: 0.4574831239241244

The training accuracy rate: 65.55%.
The test accuracy rate: 80.00%.

Figure 5: Cost curve with regularization

As the results above show, we improved the generalization error by increasing the test accuracy from 78% to 80%. On the other hand, training accuracy decreased from 82.30% to 65.55%.

Conclusion

Regularization is an effective technique to resolve overfitting. Since we don’t know true distribution of the data, empirical risk, which is based of empirical distribution, is prone to overfitting. Therefore, the best strategy is to fit training data really well and then use a regularization technique so that the model generalizes well. L2 parameter regularization along with Dropout are two of the most widely used regularization technique in machine learning.

One of the implicit assumptions of regularization techniques such as L2 and L1 parameter regularization is that the value of the parameters should be zero and try to shrink all parameters towards zero. It’s meant to avoid following the training data very well which makes the learning algorithm picks some noise that is not helpful when applied on unseen data.
The value of λ should be tuned to get the best generalization error. We typically use validation set when comparing models with values for λs and pick the one with the lowest validation error.
Only use regularization if the model suffers from overfitting, i.e training error ≪validation error.
If after using regularization the validation error is still high, then we’re most likely in the underfitting region. In other words, our model is still too simple and already has high bias. Therefore, add complexity to the model and then use regularization.
Since the majority of tasks we try to solve don’t have enough data (or expensive to collect more data), overfitting will be more prevalent in Deep Learning than underfitting given the complexity of neural networks.

The source code that created this post can be found here. The post is inspired by deeplearning.ai courses.

Originally published at imaddabbura.github.io on May 8, 2018.

Coding Neural Network — Regularization was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Coding Neural Network — Parameters’ Initialization

Imad Dabbura — Fri, 20 Apr 2018 00:00:00 GMT

Optimization, in Machine Learning/Deep Learning contexts, is the process of changing the model’s parameters to improve its performance. In other words, it’s the process of finding the best parameters in the predefined hypothesis space to get the best possible performance. There are three kinds of optimization algorithms:

Optimization algorithm that is not iterative and simply solves for one point.
Optimization algorithm that is iterative in nature and converges to acceptable solution regardless of the parameters initialization such as gradient descent applied to logistic regression.
Optimization algorithm that is iterative in nature and applied to a set of problems that have non-convex loss functions such as neural networks. Therefore, parameters’ initialization plays a critical role in speeding up convergence and achieving lower error rates.

In this post, we’ll look at three different cases of parameters’ initialization and see how this affects the error rate:

Initialize all parameters to zero.
Initialize parameters to random values from standard normal distribution or uniform distribution and multiply it by a scalar such as 10.
Initialize parameters based on:

Xavier recommendation.
Kaiming He recommendation.

We’ll be using functions we wrote in “Coding Neural Network — Forward Propagation and Backpropagation” post to initialize parameters, compute forward propagation and back-propagation as well as the cross-entropy cost.

To illustrate the above cases, we’ll use the cats vs dogs dataset which consists of 50 images for cats and 50 images for dogs. Each image is 150 pixels x 150 pixels on RGB color scale. Therefore, we would have 67,500 features where each column in the input matrix would be one image which means our input data would have 67,500 x 100 dimension.

Let’s first load the data and show a sample of two images before we start the helper functions.

https://medium.com/media/4c3df18aac8cb1534267ae7703d3cfd2/href

Figure 1: Sample images

We’ll write now all the helper functions that will help us initialize parameters based on different methods as well as writing L-layer model that we’ll be using to train our neural network.

https://medium.com/media/0025e6381037900c79774e52b6c2048e/href https://medium.com/media/065a1c79a8c5d1fa7e7d351a2e3a57de/href

Initializing all parameters to zero

Here, we’ll initialize all weight matrices and biases to zeros and see how this would affect the error rate as well as the learning parameters.

# train NN with zeros initialization parameters
layers_dims = [X.shape[0], 5, 5, 1]
parameters = model(X, Y, layers_dims, hidden_layers_activation_fn="tanh", initialization_method="zeros") accuracy(X, parameters, Y,"tanh")

The cost after 100 iterations is: 0.6931471805599453
The cost after 200 iterations is: 0.6931471805599453
The cost after 300 iterations is: 0.6931471805599453
The cost after 400 iterations is: 0.6931471805599453
The cost after 500 iterations is: 0.6931471805599453
The cost after 600 iterations is: 0.6931471805599453
The cost after 700 iterations is: 0.6931471805599453
The cost after 800 iterations is: 0.6931471805599453
The cost after 900 iterations is: 0.6931471805599453
The cost after 1000 iterations is: 0.6931471805599453

The accuracy rate is: 50.00%.

Figure 2: Cost curve using zero intialization method

As the cost curve shows, the neural network didn’t learn anything! That is because of symmetry between all neurons which leads to all neurons have the same update on every iteration. Therefore, regardless of how many iterations we run the optimization algorithms, all the neurons would still get the same update and no learning would happen. As a result, we must break symmetry when initializing parameters so that the model would start learning on each update of the gradient descent.

Initializing parameters with big random values

There is no big difference if the random values are initialized from standard normal distribution or uniform distribution so we’ll use standard normal distribution in our examples. Also, we’ll multiply the random values by a big number such as 10 to show that initializing parameters to big values may cause our optimization to have higher error rates (and even diverge in some cases). Let’s now train our neural network where all weight matrices have been intitialized using the following formula: np.random.randn() * 10

# train NN with random initialization parameters
layers_dims = [X.shape[0], 5, 5, 1]
parameters = model(X, Y, layers_dims, hidden_layers_activation_fn="tanh", initialization_method="random") accuracy(X, parameters, Y,"tanh")

The cost after 100 iterations is: 1.2413142077549013
The cost after 200 iterations is: 1.1258751902393416
The cost after 300 iterations is: 1.0989052435267657
The cost after 400 iterations is: 1.0840966471282327
The cost after 500 iterations is: 1.0706953292105978
The cost after 600 iterations is: 1.0574847320236294
The cost after 700 iterations is: 1.0443168708889223
The cost after 800 iterations is: 1.031157857251139
The cost after 900 iterations is: 1.0179838815204902
The cost after 1000 iterations is: 1.004767088515343

The accuracy rate is: 55.00%.

Figure 3: Cost curve using random initialization method

Random initialization here is helping but still the loss function has high value and may take long time to converge and achieve a significantly low value.

Initializing parameters based on He and Xavier recommendations

We’ll explore two initialization methods:

Kaiming He method is best applied when activation function applied on hidden layers is Rectified Linear Unit (ReLU). so that the weight on each hidden layer would have the following variance: var(W^l )= 2/n^(l-1). We can achieve this by multiplying the random values from standard normal distribution by

Xavier method is best applied when activation function applied on hidden layers is Hyperbolic Tangent so that the weight on each hidden layer would have the following variance: var(W^l )= 1/n^(l-1). We can achieve this by multiplying the random values from standard normal distribution by

We’ll train the network using both methods and look at the results.

# train NN where all parameters were initialized based on He recommendation
layers_dims = [X.shape[0], 5, 5, 1]
parameters = model(X, Y, layers_dims, hidden_layers_activation_fn="tanh", initialization_method="he")

accuracy(X, parameters, Y,"tanh")

The cost after 100 iterations is: 0.6300611704834093
The cost after 200 iterations is: 0.49092836452522753
The cost after 300 iterations is: 0.46579423512433943
The cost after 400 iterations is: 0.6516254192289226
The cost after 500 iterations is: 0.32487779301799485
The cost after 600 iterations is: 0.4631461605716059
The cost after 700 iterations is: 0.8050310690163623
The cost after 800 iterations is: 0.31739195517372376
The cost after 900 iterations is: 0.3094592175030812
The cost after 1000 iterations is: 0.19934509244449203

The accuracy rate is: 99.00%.

Figure 4: Cost curve using He initialization method

# train NN where all parameters were initialized based on Xavier recommendation
layers_dims = [X.shape[0], 5, 5, 1]
parameters = model(X, Y, layers_dims, hidden_layers_activation_fn="tanh", initialization_method="xavier") accuracy(X, parameters, Y,"tanh")

accuracy(X, parameters, Y, "tanh")

The cost after 100 iterations is: 0.6351961521800779
The cost after 200 iterations is: 0.548973489787121
The cost after 300 iterations is: 0.47982386652748565
The cost after 400 iterations is: 0.32811768889968684
The cost after 500 iterations is: 0.2793453045790634
The cost after 600 iterations is: 0.3258507563809604
The cost after 700 iterations is: 0.2873032724176074
The cost after 800 iterations is: 0.0924974839405706
The cost after 900 iterations is: 0.07418011931058155
The cost after 1000 iterations is: 0.06204402572328295

The accuracy rate is: 99.00%.

Figure 5: Cost curve using Xavier initialization method

As shown from applying the four methods, parameters’ initial values play a huge role in achieving low cost values as well as converging and achieve lower training error rates. The same would apply to test error rate if we had test data.

Conclusion

Deep Learning frameworks make it easier to choose between different initialization methods without worrying about implementing it ourselves. Nonetheless, it’s important to understand the critical role initial values of the parameters in the overall performance of the network. Below are some key takeaways:

Well chosen initialization values of parameters leads to:

Speed up convergence of gradient descent.
Increase the likelihood of gradient descent to find lower training and generalization error rates.

Because we’re dealing with iterative optimization algorithms with non-convex loss function, different initializations lead to different results.
Random initialization is used to break symmetry and make sure different hidden units can learn different things.
Don’t initialize to values that are too large.
Kaiming He (He) initialization works well for neural networks with ReLU activation function.
Xavier initialization works well for neural networks with Hyperbolic Tangent activation function.

The source code that created this post can be found here.

Originally published at imaddabbura.github.io on April 20, 2018.

Coding Neural Network — Parameters’ Initialization was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Coding Neural Network — Gradient Checking

Imad Dabbura — Sun, 08 Apr 2018 00:00:00 GMT

In the previous post, Coding Neural Network — Forward Propagation and Backpropagation, we implemented both forward propagation and backpropagation in numpy. However, implementing backpropagation from scratch is usually more prune to bugs/errors. Therefore, it’s necessary before running the neural network on training data to check if our implementation of backpropagation is correct. Before we start, let’s revisit what back-propagation is: We loop over the nodes in reverse topological order starting at the final node to compute the derivative of the cost with respect to each edge’s node tail. In other words, we compute the derivative of cost function with respect to all parameters, i.e ∂J/∂θ where θ represents the parameters of the model.

The way to test our implementation is by computing numerical gradients and compare it with gradients from backpropagation (analytical). There are two way of computing numerical gradients:

Right-hand form:

Two-sided form (see figure 1):

Figure 1: Two-sided numerical gradients

Two-sided form of approximating the derivative is closer than the right-hand form. Let’s illustrate that with the following example using the function f(x) = x² by taking its derivative at x = 3.

Analytical derivative:

Two-sided numerical derivative:

Right-hand numerical derivative:

As we see above, the difference between analytical derivative and two-sided numerical gradient is almost zero; however, the difference between analytical derivative and right-sided derivative is 0.01. Therefore, we’ll use two-sided epsilon method to compute the numerical gradients.

In addition, we’ll normalize the difference between numerical. gradients and analytical gradients using the following formula: If the difference is ≤ 10e-7, then our implementation is fine; otherwise, we have a mistake somewhere and have to go back and revisit backpropagation code.

Below are the steps needed to implement gradient checking:

Pick random number of examples from training data to use it when computing both numerical and analytical gradients.

Don’t use all examples in the training data because gradient checking is very slow.

2. Initialize parameters.

3. Compute forward propagation and the cross-entropy cost.

4. Compute the gradients using our back-propagation implementation.

5. Compute the numerical gradients using the two-sided epsilon method.

6. Compute the difference between numerical and analytical gradients.

Let’s first import the data.

# Loading packages
import sys
import h5py
import matplotlib.pyplot as plt
import numpy as np
from numpy.linalg import norm

import seaborn as sns sys.path.append("../scripts/")
from coding_neural_network_from_scratch import (initialize_parameters, L_model_forward, L_model_backward, compute_cost)

# Import the data
train_dataset = h5py.File("../data/train_catvnoncat.h5")
X_train = np.array(train_dataset["train_set_x"]).T
y_train = np.array(train_dataset["train_set_y"]).T
X_train = X_train.reshape(-1, 209)
y_train = y_train.reshape(-1, 209)

X_train.shape, y_train.shape

((12288, 209), (1, 209))

Next, we’ll write helper functions that faciltate converting parameters and gradients dictionaries into vectors and then re-convert them back to dictionaries.

def dictionary_to_vector(params_dict):
L = len(layers_dims)
parameters = {}
k = 0
for l in range(1, L):
# Create temp variable to store dimension used on each layer
w_dim = layers_dims[l] * layers_dims[l - 1]
b_dim = layers_dims[l]
# Create temp var to be used in slicing parameters vector
temp_dim = k + w_dim
# add parameters to the dictionary
parameters["W" + str(l)] = vector[ k:temp_dim].reshape(layers_dims[l], layers_dims[l - 1]) 
parameters["b" + str(l)] = vector[ temp_dim:temp_dim + b_dim].reshape(b_dim, 1)
k += w_dim + b_dim
return parameters

def gradients_to_vector(gradients):
# Get the number of indices for the gradients to iterate over valid_grads = [key for key in gradients.keys() if not key.startswith("dA")]
L = len(valid_grads)// 2
count = 0
# Iterate over all gradients and append them to new_grads list
for l in range(1, L + 1):
if count == 0:
new_grads = gradients["dW" + str(l)].reshape(-1, 1)
new_grads = np.concatenate((new_grads, gradients["db" + str(l)].reshape(-1, 1)))
else:
new_grads = np.concatenate((new_grads, gradients["dW" + str(l)].reshape(-1, 1)))
new_grads = np.concatenate( (new_grads, gradients["db" + str(l)].reshape(-1, 1)))
count += 1
return new_grads

Finally, we’ll write the gradient checking function that will compute the difference between the analytical and numerical gradients and tell us if our implementation of back-propagation is correct. We’ll randomly choose 1 example to compute the difference.

def forward_prop_cost(X, parameters, Y, hidden_layers_activation_fn="tanh"):
# Compute forward prop
AL, _ = L_model_forward(X, parameters, hidden_layers_activation_fn) # Compute cost
cost = compute_cost(AL, Y)
return cost

def gradient_check( parameters, gradients, X, Y, layers_dims, epsilon=1e-7, hidden_layers_activation_fn="tanh"):
# Roll out parameters and gradients dictionaries
parameters_vector = dictionary_to_vector(parameters) gradients_vector = gradients_to_vector(gradients)
# Create vector of zeros to be used with epsilon
grads_approx = np.zeros_like(parameters_vector)
for i in range(len(parameters_vector)):
# Compute cost of theta + epsilon
theta_plus = np.copy(parameters_vector)
theta_plus[i] = theta_plus[i] + epsilon
j_plus = forward_prop_cost( X, vector_to_dictionary(theta_plus, layers_dims), Y, hidden_layers_activation_fn)
# Compute cost of theta - epsilon
theta_minus = np.copy(parameters_vector)
theta_minus[i] = theta_minus[i] - epsilon
j_minus = forward_prop_cost( X, vector_to_dictionary(theta_minus, layers_dims), Y, hidden_layers_activation_fn)
# Compute numerical gradients
grads_approx[i] = (j_plus - j_minus) / (2 * epsilon)

# Compute the difference of numerical and analytical gradients numerator = norm(gradients_vector - grads_approx)
denominator = norm(grads_approx) + norm(gradients_vector)
difference = numerator / denominator

if difference > 10e-7:
print ("\033[31mThere is a mistake in back-propagation " +\ "implementation. The difference is: {}".format(difference))
else:
print ("\033[32mThere implementation of back-propagation is fine! "+\ "The difference is: {}".format(difference))

return difference

# Set up neural network architecture
layers_dims = [X_train.shape[0], 5, 5, 1]

# Initialize parameters parameters = initialize_parameters(layers_dims)

# Randomly selecting 1 example from training data
perms = np.random.permutation(X_train.shape[1])
index = perms[:1]

# Compute forward propagation
AL, caches = L_model_forward(X_train[:, index], parameters, "tanh")

# Compute analytical gradients
gradients = L_model_backward(AL, y_train[:, index], caches, "tanh")

# Compute difference of numerical and analytical gradients difference = gradient_check(parameters, gradients, X_train[:, index], y_train[:, index], layers_dims)

There implementation of back-propagation is fine! The difference is: 3.0220555297630148e-09

Congratulations! Our implementation is correct :)

Conclusion

Below are some key takeaways:

Two-sided numerical gradient approximates the analytical gradients more closely than right-side form.

2. Since gradient checking is very slow:

Apply it on one or few training examples.
Turn it off when training neural network after making sure that backpropagation’s implementation is correct.

3. Gradient checking doesn’t work when applying drop-out method. Use keep-prob = 1 to check gradient checking and then change it when training neural network.

4. Epsilon = 10e-7 is a common value used for the difference between analytical gradient and numerical gradient. If the difference is less than 10e-7 then the implementation of backpropagation is correct.

5. Thanks to Deep Learning frameworks such as Tensorflow and Pytorch, we may find ourselves rarely implement backpropagation because such frameworks compute that for us; however, it’s a good practice to understand what happens under the hood to become a good Deep Learning practitioner.

The source code that created this post can be found here.

Originally published at imaddabbura.github.io on April 8, 2018.

Coding Neural Network — Gradient Checking was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Coding Neural Network — Forward Propagation and Backpropagtion

Imad Dabbura — Sun, 01 Apr 2018 00:00:00 GMT

Why Neural Networks?

According to Universal Approximate Theorem, Neural Networks can approximate as well as learn and represent any function given a large enough layer and desired error margin. The way neural network learns the true function is by building complex representations on top of simple ones. On each hidden layer, the neural network learns new feature space by first compute the affine (linear) transformations of the given inputs and then apply non-linear function which in turn will be the input of the next layer. This process will continue until we reach the output layer. Therefore, we can define neural network as information flows from inputs through hidden layers towards the output. For a 3-layers neural network, the learned function would be: f(x) = f_3(f_2(f_1(x))) where:

f_1(x): Function learned on first hidden layer
f_2(x): Function learned on second hidden layer
f_3(x): Function learned on output layer

Therefore, on each layer we learn different representation that gets more complicated with later hidden layers.Below is an example of a 3-layers neural network (we don’t count input layer):

Figure 1: Neural Network with two hidden layers

For example, computers can’t understand images directly and don’t know what to do with pixels data. However, a neural network can build a simple representation of the image in the early hidden layers that identifies edges. Given the first hidden layer output, it can learn corners and contours. Given the second hidden layer, it can learn parts such as nose. Finally, it can learn the object identity.

Since truth is never linear and representation is very critical to the performance of a machine learning algorithm, neural network can help us build very complex models and leave it to the algorithm to learn such representations without worrying about feature engineering that takes practitioners very long time and effort to curate a good representation.

The post has two parts:

Coding the neural network: This entails writing all the helper functions that would allow us to implement a multi-layer neural network. While doing so, I’ll explain the theoretical parts whenever possible and give some advices on implementations.
Application: We’ll implement the neural network we coded in the first part on image recognition problem to see if the network we built will be able to detect if the image has a cat or a dog and see it working :)

This post will be the first in a series of posts that cover implementing neural network in numpy including gradient checking, parameter initialization, L2 regularization, dropout. The source code that created this post can be found here.

# Import packages
import h5py import
matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

I. Coding The Neural Network

Forward Propagation

The input X provides the initial information that then propagates to the hidden units at each layer and finally produce the output y^. The architecture of the network entails determining its depth, width, and activation functions used on each layer. Depth is the number of hidden layers. Width is the number of units (nodes) on each hidden layer since we don’t control neither input layer nor output layer dimensions. There are quite a few set of activation functions such Rectified Linear Unit, Sigmoid, Hyperbolic tangent, etc. Research has proven that deeper networks outperform networks with more hidden units. Therefore, it’s always better and won’t hurt to train a deeper network (with diminishing returns).

Lets first introduce some notations that will be used throughout the post:

Next, we’ll write down the dimensions of a multi-layer neural network in the general form to help us in matrix multiplication because one of the major challenges in implementing a neural network is getting the dimensions right.

The two equations we need to implement forward propagations are: These computations will take place on each layer.

Parameters Initialization

We’ll first initialize the weight matrices and the bias vectors. It’s important to note that we shouldn’t initialize all the parameters to zero because doing so will lead the gradients to be equal and on each iteration the output would be the same and the learning algorithm won’t learn anything. Therefore, it’s important to randomly initialize the parameters to values between 0 and 1. It’s also recommended to multiply the random values by small scalar such as 0.01 to make the activation units active and be on the regions where activation functions’ derivatives are not close to zero.

https://medium.com/media/f5b2e5831b2043c03d7348ae3c16ca5c/href

Activation Functions

There is no definitive guide for which activation function works best on specific problems. It’s a trial and error process where one should try different set of functions and see which one works best on the problem at hand. We’ll cover 4 of the most commonly used activation functions:

Sigmoid function (σ): g(z) = 1 / (1 + e^{-z}). It’s recommended to be used only on the output layer so that we can easily interpret the output as probabilities since it has restricted output between 0 and 1. One of the main disadvantages for using sigmoid function on hidden layers is that the gradient is very close to zero over a large portion of its domain which makes it slow and harder for the learning algorithm to learn.
Hyperbolic Tangent function: g(z) = (e^z -e^{-z}) / (e^z + e^{-z}). It’s superior to sigmoid function in which the mean of its output is very close to zero, which in other words center the output of the activation units around zero and make the range of values very small which means faster to learn. The disadvantage that it shares with sigmoid function is that the gradient is very small on good portion of the domain.
Rectified Linear Unit (ReLU): g(z) = max{0, z}. The models that are close to linear are easy to optimize. Since ReLU shares a lot of the properties of linear functions, it tends to work well on most of the problems. The only issue is that the derivative is not defined at z = 0, which we can overcome by assigning the derivative to 0 at z = 0. However, this means that for z ≤ 0 the gradient is zero and again can’t learn.
Leaky Rectified Linear Unit: g(z) = max{α*z, z}. It overcomes the zero gradient issue from ReLU and assigns α which is a small value for z ≤ 0.

If you’re not sure which activation function to choose, start with ReLU. Next, we’ll implement the above activation functions and draw a graph for each one to make it easier to see the domain and range of each function.

https://medium.com/media/9cef07a51dc0ea616ced1107a994f4b0/href https://medium.com/media/187d141895ea630ffcbc11cc6b252c14/href

Feed Forward

Given its inputs from previous layer, each unit computes affine transformation z = W^Tx + b and then apply an activation function g(z) such as ReLU element-wise. During the process, we’ll store (cache) all variables computed and used on each layer to be used in back-propagation. We’ll write first two helper functions that will be used in the L-model forward propagation to make it easier to debug. Keep in mind that on each layer, we may have different activation function.

https://medium.com/media/3026c3f90a97f2a067e07d21e0c21309/href

Cost

We’ll use the binary Cross-Entropy cost. It uses the log-likelihood method to estimate its error. The cost is: The above cost function is convex; however, neural network usually stuck on a local minimum and is not guaranteed to find the optimal parameters. We’ll use here gradient-based learning.

https://medium.com/media/5cbfff65fe24d029b7ee902557e74e4e/href

Back-Propagation

Allows the information to go back from the cost backward through the network in order to compute the gradient. Therefore, loop over the nodes starting at the final node in reverse topological order to compute the derivative of the final node output with respect to each edge’s node tail. Doing so will help us know who is responsible for the most error and change the parameters in that direction. The following derivatives’ formulas will help us write the back-propagate functions: Since b^l is always a vector, the sum would be across rows (since each column is an example).

https://medium.com/media/e982ef4c4f5319dcae77c51dd1f87701/href https://medium.com/media/850339d4c359873ea703f3707a1f40e3/href

II. Application

The dataset that we’ll be working on has 209 images. Each image is 64 x 64 pixels on RGB scale. We’ll build a neural network to classify if the image has a cat or not. Therefore, y^i ∈ {0, 1}.

We’ll first load the images.
Show sample image for a cat.
Reshape input matrix so that each column would be one example. Also, since each image is 64 x 64 x 3, we’ll end up having 12,288 features for each image. Therefore, the input matrix would be 12,288 x 209.
Standardize the data so that the gradients don’t go out of control. Also, it will help hidden units have similar range of values. For now, we’ll divide every pixel by 255 which shouldn’t be an issue. However, it’s better to standardize the data to have a mean of 0 and a standard deviation of 1.

https://medium.com/media/087265b1faf014892ce282629f2c5524/href

Original dimensions:
--------------------
Training: (209, 64, 64, 3), (209,)
Test: (50, 64, 64, 3), (50,)

New dimensions:
---------------
Training: (12288, 209), (1, 209)
Test: (12288, 50), (1, 50)

Figure 3: Sample image

Now, our dataset is ready to be used and test our neural network implementation. Let’s first write multi-layer model function to implement gradient-based learning using predefined number of iterations and learning rate.

https://medium.com/media/6d5d485f7044c017e8de6035be6b2ea6/href

Next, we’ll train two versions of the neural network where each one will use different activation function on hidden layers: One will use rectified linear unit (ReLU) and the second one will use hyperbolic tangent function (tanh). Finally we’ll use the parameters we get from both neural networks to classify training examples and compute the training accuracy rates for each version to see which activation function works best on this problem.

# Setting layers dims
layers_dims = [X_train.shape[0], 5, 5, 1]

# NN with tanh activation fn
parameters_tanh = L_layer_model( X_train, y_train, layers_dims, learning_rate=0.03, num_iterations=3000, hidden_layers_activation_fn="tanh")

# Print the accuracy
accuracy(X_test, parameters_tanh, y_test, activation_fn="tanh")

The cost after 100 iterations is: 0.6556 
The cost after 200 iterations is: 0.6468
The cost after 300 iterations is: 0.6447
The cost after 400 iterations is: 0.6441
The cost after 500 iterations is: 0.6440
The cost after 600 iterations is: 0.6440
The cost after 700 iterations is: 0.6440
The cost after 800 iterations is: 0.6439
The cost after 900 iterations is: 0.6439
The cost after 1000 iterations is: 0.6439
The cost after 1100 iterations is: 0.6439
The cost after 1200 iterations is: 0.6439
The cost after 1300 iterations is: 0.6438
The cost after 1400 iterations is: 0.6438
The cost after 1500 iterations is: 0.6437
The cost after 1600 iterations is: 0.6434
The cost after 1700 iterations is: 0.6429
The cost after 1800 iterations is: 0.6413
The cost after 1900 iterations is: 0.6361
The cost after 2000 iterations is: 0.6124
The cost after 2100 iterations is: 0.5112
The cost after 2200 iterations is: 0.5288
The cost after 2300 iterations is: 0.4312
The cost after 2400 iterations is: 0.3821
The cost after 2500 iterations is: 0.3387
The cost after 2600 iterations is: 0.2349
The cost after 2700 iterations is: 0.2206
The cost after 2800 iterations is: 0.1927
The cost after 2900 iterations is: 0.4669
The cost after 3000 iterations is: 0.1040

'The accuracy rate is: 68.00%.'

Figure 4: Loss curve with tanh activation function

# NN with relu activation fn
parameters_relu = L_layer_model( X_train, y_train, layers_dims, learning_rate=0.03, num_iterations=3000, hidden_layers_activation_fn="relu")

# Print the accuracy
accuracy(X_test, parameters_relu, y_test, activation_fn="relu")

The cost after 100 iterations is: 0.6556
The cost after 200 iterations is: 0.6468
The cost after 300 iterations is: 0.6447
The cost after 400 iterations is: 0.6441
The cost after 500 iterations is: 0.6440
The cost after 600 iterations is: 0.6440 
The cost after 700 iterations is: 0.6440 
The cost after 800 iterations is: 0.6440 
The cost after 900 iterations is: 0.6440 
The cost after 1000 iterations is: 0.6440 
The cost after 1100 iterations is: 0.6439 
The cost after 1200 iterations is: 0.6439 
The cost after 1300 iterations is: 0.6439 
The cost after 1400 iterations is: 0.6439 
The cost after 1500 iterations is: 0.6439 
The cost after 1600 iterations is: 0.6439 
The cost after 1700 iterations is: 0.6438 
The cost after 1800 iterations is: 0.6437 
The cost after 1900 iterations is: 0.6435 
The cost after 2000 iterations is: 0.6432 
The cost after 2100 iterations is: 0.6423 
The cost after 2200 iterations is: 0.6395 
The cost after 2300 iterations is: 0.6259 
The cost after 2400 iterations is: 0.5408 
The cost after 2500 iterations is: 0.5262 
The cost after 2600 iterations is: 0.4727 
The cost after 2700 iterations is: 0.4386 
The cost after 2800 iterations is: 0.3493 
The cost after 2900 iterations is: 0.1877 
The cost after 3000 iterations is: 0.3641

'The accuracy rate is: 42.00%.'

Figure 5: Loss curve with ReLU activation function

Please note that the accuracy rates above are expected to overestimate the generalization accuracy rates.

Conclusion

The purpose of this post is to code Deep Neural Network step-by-step and explain the important concepts while doing that. We don’t really care about the accuracy rate at this moment since there are tons of things we could’ve done to increase the accuracy which would be the subject of following posts. Below are some takeaways:

Even if neural network can represent any function, it may fail to learn for two reasons:

The optimization algorithm may fail to find the best value for the parameters of the desired (true) function. It can stuck in a local optimum.
The learning algorithm may find different functional form that is different than the intended function due to overfitting.

Even if neural network rarely converges and always stuck in a local minimum, it is still able to reduce the cost significantly and come up with very complex models with high test accuracy.
The neural network we used in this post is standard fully connected network. However, there are two other kinds of networks:

Convolutional NN: Where not all nodes are connected. It’s best in class for image recognition.
Recurrent NN: There is a feedback connections where output of the model is fed back into itself. It’s used mainly in sequence modeling.

The fully connected neural network also forgets what happened in previous steps and also doesn’t know anything about the output.
There are number of hyperparameters that we can tune using cross validation to get the best performance of our network:

Learning rate (α): Determines how big the step for each update of parameters.

A. Small α leads to slow convergence and may become computationally very expensive.

B. Large α may lead to overshooting where our learning algorithm may never converge.

2. Number of hidden layers (depth): The more hidden layers the better, but comes at a cost computationally.

3. Number of units per hidden layer (width): Research proven that huge number of hidden units per layer doesn’t add to the improvement of the network.

4. Activation function: Which function to use on hidden layers differs among applications and domains. It’s a trial and error process to try different functions and see which one works best.

5. Number of iterations.

Standardize data would help activation units have similar range of values and avoid gradients to go out of control.

Originally published at imaddabbura.github.io on April 1, 2018.

Coding Neural Network — Forward Propagation and Backpropagtion was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Predicting Loan Repayment

Imad Dabbura — Thu, 15 Mar 2018 00:00:00 GMT

Introduction

The two most critical questions in the lending industry are: 1) How risky is the borrower? 2) Given the borrower’s risk, should we lend him/her? The answer to the first question determines the interest rate the borrower would have. Interest rate measures among other things (such as time value of money) the riskness of the borrower, i.e. the riskier the borrower, the higher the interest rate. With interest rate in mind, we can then determine if the borrower is eligible for the loan.

Investors (lenders) provide loans to borrowers in exchange for the promise of repayment with interest. That means the lender only makes profit (interest) if the borrower pays off the loan. However, if he/she doesn’t repay the loan, then the lender loses money.

We’ll be using publicly available data from LendingClub.com. The data covers the 9,578 loans funded by the platform between May 2007 and February 2010. The interest rate is provided to us for each borrower. Therefore, so we’ll address the second question indirectly by trying to predict if the borrower will repay the loan by its mature date or not. Through this excerise we’ll illustrate three modeling concepts:

What to do with missing values.
Techniques used with imbalanced classification problems.
Illustrate how to build an ensemble model using two methods: blending and stacking, which most likely gives us a boost in performance.

Below is a short description of each feature in the data set:

credit_policy: 1 if the customer meets the credit underwriting criteria of LendingClub.com, and 0 otherwise.
purpose: The purpose of the loan such as: credit_card, debt_consolidation, etc.
int_rate: The interest rate of the loan (proportion).
installment: The monthly installments ($) owed by the borrower if the loan is funded.
log_annual_inc: The natural log of the annual income of the borrower.
dti: The debt-to-income ratio of the borrower.
fico: The FICO credit score of the borrower.
days_with_cr_line: The number of days the borrower has had a credit line.
revol_bal: The borrower’s revolving balance.
revol_util: The borrower’s revolving line utilization rate.
inq_last_6mths: The borrower’s number of inquiries by creditors in the last 6 months.
delinq_2yrs: The number of times the borrower had been 30+ days past due on a payment in the past 2 years.
pub_rec: The borrower’s number of derogatory public records.
not_fully_paid: indicates whether the loan was not paid back in full (the borrower either defaulted or the borrower was deemed unlikely to pay it back).

Let’s load the data and check:

Data types of each feature
If we have missing values
If we have imbalanced data

Source code that created this post can be found here.

https://medium.com/media/488e457de9851748a19d6acb92550171/href

Figure 1: Data types/missing values

https://medium.com/media/e9956ebd570afcfc751fcc5aebb2d143/href

Figure 2: Class counts

Positive examples = 1533
Negative examples = 8045
Proportion of positive to negative examples = 19.06%

It looks like we have only one categorical feature (“purpose”). Also, six features have missing values (no missing values in labels). Moreover, the data set is pretty imbalanced as expected where positive examples (“not paid fully”) are only 19%. We’ll explain in the next section how to handle all of them after giving an overview of ensemble methods.

Modeling

Ensemble methods can be defined as combining several different models (base learners) into final model (meta learner) to reduce the generalization error. It relies on the assumption that each model would look at a different aspect of the data which yield to capturing part of the truth. Combining good performing models the were trained independently will capture more of the truth than a single model. Therefore, this would result in more accurate predictions and lower generalization errors.

Almost always ensemble model performance gets improved as we add more models.
Try to combine models that are as much different as possible. This will reduce the correlation between the models that will improve the performance of the ensemble model that will lead to significantly outperform the best model. In the worst case where all models are perfectly correlated, the ensemble would have the same performance as the best model and sometimes even lower if some models are very bad. As a result, pick models that are as good as possible.

Diﬀerent ensemble methods construct the ensemble of models in diﬀerent ways. Below are the most common methods:

Blending: Averaging the predictions of all models.
Bagging: Build different models on different datasets and then take the majority vote from all the models. Given the original dataset, we sample with replacement to get the same size of the original dataset. Therefore, each dataset will include, on average, 2/3 of the original data and the rest 1/3 will be duplicates. Since each model will be built on a different dataset, it can be seen as a different model. Random Forest improves default bagging trees by reducing the likelihood of strong features to picked on every split. In other word, it reduces the number of features available at each split from n features to, for example, n/2 or log(n) features. This will reduce correlation –> reduce variance.
Boosting: Build models sequentially. That means each model learns from the residuals of the previous model. The output will be all output of each single model weighted by the learning rate λ. It reduces the bias resulted from bagging by learning sequentially from residuals of previous trees (models).
Stacking: Build k models called base learners. Then fit a model to the output of the base learners to predict the final output.

Since we’ll be using Random Fores (bagging) and Gradient Boosting (boosting) classifiers as base learners in the ensemble model, we’ll illustrate only averaging and stacking ensemble methods. Therefore, modeling parts would be consisted of three parts:

Strategies to deal with missing values.
Strategies to deal with imbalanced datasets.
Build ensemble models.

Before going further, the following data preprocessing steps will be applicable to all models:

Create dummy variables from the feature “purpose” since its nominal (not ordinal) categorical variable. It’s also a good practice to drop the first one to avoid linear dependency between the resulted features since some algorithms may struggle with this issue.
Split the data into training set (70%), and test set (30%). Training set will be used to fit the model, and test set will be to evaluate the best model to get an estimation of generalization error. Instead of having validation set to tune hyperparameters and evaluate different models, we’ll use 10-folds cross validation because it’s more reliable estimate of generalization error.
Standardize the data. We’ll be using RobustScaler so that the standarization will be less influenced by the outliers, i.e. more robust. It centers the data around the median and scale it using interquartile range (IQR). This step will be included in the pipelines for each model as a transformer so we will not do it separately.

# Create dummy variables from the feature purpose
df = pd.get_dummies(df, columns=["purpose"], drop_first=True)

Strategies to deal with missing values

Almost always real world data sets have missing values. This can be due, for example, users didn’t fill some part of the forms or some transformations happened while collecting and cleaning the data before they send it to you. Sometimes missing values are informative and weren’t generated randomly. Therefore, it’s a good practice to add binary features to check if there is missing values in each row for each feature that has missing values. In our case, six features have missing values so we would add six binary features one for each feature. For example, “log_annual_inc” feature has missing values, so we would add a feature “is_log_annual_inc_missing” that takes the values ∈ {0, 1}. Good thing is that the missing values are in the predictors only and not the labels. Below are some of the most common strategies for dealing with missing values:

Simply delete all examples that have any missing values. This is usually done if the missing values are very small compared to the size of the data set and the missing values were random. In other words, the added binary features did not improve the model. One disadvantage for this strategy is that the model will throw an error when test data has missing values at prediction.
Impute the missing values using the mean of each feature separately.
Impute the missing values using the median of each feature separately.
Use Multivariate Imputation by Chained Equations (MICE). The main disadvantage of MICE is that we can’t use it as a transformer in sklearn pipelines and it requires to use the full data set when imputing the missing values. This means that there will be a risk of data leakage since we’re using both training and test sets to impute the missing values. The following steps explain how MICE works:

First step: Impute the missing values using the mean of each feature separately.

2. Second step: For each feature that has missing values, we take all other features as predictors (including the ones that had missing values) and try to predict the values for this feature using linear regression for example. The predicted values will replace the old values for that feature. We do this for all features that have missing values, i.e. each feature will be used once as a target variable to predict its values and the rest of the time as a predictor to predict other features’ values. Therefore, one complete cycle (iteration) will be done once we run the model $k$ times to predict the $k$ features that have missing values. For our data set, each iteration will run the linear regression 6 times to predict the 6 features.

3. Third step: Repeat step 2 until there is not much of change between predictions.

Impute the missing values using K-Nearest Neighbors. We compute distance between all examples (excluding missing values) in the data set and take the average of k-nearest neighbors of each missing value. There’s no implementation for it yet in sklearn and it’s pretty inefficient to compute it since we’ll have to go through all examples to calculate distances. Therefore, we’ll skip this strategy in this post.

To evaluate each strategy, we’ll use Random Forest classifier with hyperparameters’ values guided by Data-driven Advice for Applying Machine Learning to Bioinformatics Problems as a starting point.

Let’s first create binary features for missing values and then prepare the data for each strategy discussed above. Next, we’ll compute the 10-folds cross validation AUC score for all the models using training data.

https://medium.com/media/d0767b07f880ade1df08d2bd3a35cd46/href

Original data shapes: ((7662, 24), (1916, 24))
After dropping NAs: ((7611, 18), (1905, 18))
MICE data shapes: ((7662, 24), (1916, 24))

Baseline model's average AUC: 0.651 
Mean imputation model's average AUC: 0.651
Median imputation model's average AUC: 0.651
MICE imputation model's average AUC: 0.656

Let’s plot the feature importances to check if the added binary features added anything to the model.

https://medium.com/media/094fe18a77974e900c4439dd8fd4e538/href

Figure 3: Random Forest feature importance

Guided by the 10-fold cross validation AUC scores, it looks like all strategies have comparable results and missing values were generated randomly. Also, the added six binary features showed no importance when plotting feature importances from Random Forest classifier. Therefore, it’s safe to drop those features and use Median Imputation method as a transformer later on in the pipeline.

# Drop generated binary features
X_train = X_train[:, :-6]
X_test = X_test[:, :-6]

Strategies to deal with imbalanced datasets

Classification problems in most real world applications have imbalanced data sets. In other words, the positive examples (minority class) are a lot less than negative examples (majority class). We can see that in spam detection, ads click, loan approvals, etc. In our example, the positive examples (people who haven’t fully paid) were only 19% from the total examples. Therefore, accuracy is no longer a good measure of performance for different models because if we simply predict all examples to belong to the negative class, we achieve 81% accuracy. Better metrics for imbalanced data sets are AUC (area under the ROC curve) and f1-score. However, that’s not enough because class imbalance influences a learning algorithm during training by making the decision rule biased towards the majority class by implicitly learns a model that optimizes the predictions based on the majority class in the dataset. As a result, we’ll explore different methods to overcome class imbalance problem.

Under-Sample: Under-sample the majority class with or w/o replacement by making the number of positive and negative examples equal. One of the drawbacks of under-sampling is that it ignores a good portion of training data that has valuable information. In our example, it would loose around 6500 examples. However, it’s very fast to train.
Over-Sample: Over-sample the minority class with or w/o replacement by making the number of positive and negative examples equal. We’ll add around 6500 samples from the training data set with this strategy. It’s a lot more computationally expensive than under-sampling. Also, it’s more prune to overfitting due to repeated examples.
EasyEnsemble: Sample several subsets from the majority class, build a classifier on top of each sampled data, and combine the output of all classifiers. More details can be found here.
Synthetic Minority Oversampling Technique (SMOTE): It over-samples the minority class but using synthesized examples. It operates on feature space not the data space. Here how it works:

Compute the k-nearest neighbors for all minority samples.
Randomly choose number between 1-k.
For each feature:

a. Compute the difference between minority sample and its randomly chosen neighbor (from previous step).

b. Multiply the difference by random number between 0 and 1.

c. Add the obtained feature to the synthesized sample attributes.

4. Repeat the above until we get the number of synthesized samples needed. More information can be found here.

There are other methods such as EditedNearestNeighbors and CondensedNearestNeighbors that we will not cover in this post and are rarely used in practice.

In most applications, misclassifying the minority class (false negative) is a lot more expensive than misclassifying the majority class (false positive). In the context of lending, loosing money by lending to a risky borrower who is more likely to not fully pay the loan back is a lot more costly than missing the opportunity of lending to trust-worthy borrower (less risky). As a result, we can use class_weight that changes the weight of misclassifying positive example in the loss function. Also, we can use different cut-offs assign examples to classes. By default, 0.5 is the cut-off; however, we see more often in applications such as lending that the cut-off is less than 0.5. Note that changing the cut-off from the default 0.5 reduce the overall accuracy but may improve the accuracy of predicting positive/negative examples.

We’ll evaluate all the above methods plus the original model without resampling as a baseline model using the same Random Forest classifier we used in the missing values section.

https://medium.com/media/d9e3a65ce8cff722a800377f413c9a2f/href

Original model's average AUC: 0.652 
Under-sampled model's average AUC: 0.656 
Over-sampled model's average AUC: 0.651 
EasyEnsemble model's average AUC: 0.665 
SMOTE model's average AUC: 0.641

EasyEnsemble method has the highest 10-folds CV with average AUC = 0.665.

Build Ensemble models

We’ll build ensemble models using three different models as base learners:

Gradient Boosting
Support Vector Classifier
Random Forest

The ensemble models will be built using two different methods:

Blending (average) ensemble model. Fits the base learners to the training data and then, at test time, average the predictions generated by all the base learners. Use VotingClassifier from sklearn that:

Fits all the base learners on the training data
At test time, use all base learners to predict test data and then take the average of all predictions.

Stacked ensemble model: Fits the base learners to the training data. Next, use those trained base learners to generate predictions (meta-features) used by the meta-learner (assuming we have only one layer of base learners). There are few different ways of training stacked ensemble model:

Fitting the base learners to all training data and then generate predictions using the same training data it was used to fit those learners. This method is more prune to overfitting because the meta learner will give more weights to the base learner who memorized the training data better, i.e. meta-learner won’t generate well and would overfit.
Split the training data into 2 to 3 different parts that will be used for training, validation, and generate predictions. It’s a suboptimal method because held out sets usually have higher variance and different splits give different results as well as learning algorithms would have fewer data to train.
Use k-folds cross validation where we split the data into k-folds. We fit the base learners to the (k -1) folds and use the fitted models to generate predictions of the held out fold. We repeat the process until we generate the predictions for all the k-folds. When done, refit the base learners to the full training data. This method is more reliable and will give models that memorize the data less weight. Therefore, it generalizes better on future data.

We’ll use logistic regression as the meta-learner for the stacked model. Note that we can use k-folds cross validation to validate and tune the hyperparameters of the meta learner. We will not tune the hyperparameters of any of the base learners or the meta-learner; however, we will use some of the values recommended by the Pennsylvania Benchmarking Paper. Additionally, we won’t use EasyEnsemble in training because, after some experimentation, it didn’t improve the AUC of the ensemble model more than 2% on average and it was computationally very expensive. In practice, we sometimes are willing to give up small improvements if the model would become a lot more complex computationally. Therefore, we will use RandomUnderSampler. Also, we’ll impute the missing values and standardize the data beforehand so that it would shorten the code of the ensemble models and allows use to avoid using Pipeline. Additionally, we will plot ROC and PR curves using test data and evaluate the performance of all models.

https://medium.com/media/a23e5e8c676655eaa7ca1582627207e4/href

Figure 4: ROC and PR curves

As we can see from the chart above, stacked ensemble model didn’t improve the performance. One of the major reasons are that the base learners are considerably highly correlated especially Random Forest and Gradient Boosting (see the correlation matrix below).

# Plot the correlation between base learners probs_df = pd.DataFrame(meta_features, columns=["xgb", "svm", "rf"]) corrmat(probs_df.corr(), inflate=True);

Figure 5: Correlation matrix

In addition, with classification problems where False Negatives are a lot more expensive than False Positives, we may want to have a model with a high recall rather than high precision. Below is the confusion matrix:

https://medium.com/media/e22c3250132dd8a8f2d6e9fa2e919fbf/href

Figure 6: Confusion matrix

Let’s finally check the partial dependence plots to see what are the most important features and their relationships with whether the borrower will most likely pay the loan in full before mature data. we will plot only the top 8 features to make it easier to read. Note that the partial plots are based on Gradient Boosting model.

https://medium.com/media/1aafc8d587439576c3941c51225ce5d8/href

Figure 7: Partial dependence plots

As we might expected, borrowers with lower annual income and less FICO scores are less likely to pay the loan fully; however, borrowers with lower interest rates (riskier) and smaller installments are more likely to pay the loan fully.

Conclusion

Most classification problems in the real world are imbalanced. Also, almost always data sets have missing values. In this post, we covered strategies to deal with both missing values and imbalanced data sets. We also explored different ways of building ensembles in sklearn. Below are some takeaway points:

There is no definitive guide of which algorithms to use given any situation. What may work on some data sets may not necessarily work on others. Therefore, always evaluate methods using cross validation to get a reliable estimates.
Sometimes we may be willing to give up some improvement to the model if that would increase the complexity much more than the percentage change in the improvement to the evaluation metrics.
In some classification problems, False Negatives are a lot more expensive than False Positives. Therefore, we can reduce cut-off points to reduce the False Negatives.
When building ensemble models, try to use good models that are as different as possible to reduce correlation between the base learners. We could’ve enhanced our stacked ensemble model by adding Dense Neural Network and some other kind of base learners as well as adding more layers to the stacked model.
EasyEnsemble usually performs better than any other resampling methods.
Missing values sometimes add more information to the model than we might expect. One way of capturing it is to add binary features for each feature that has missing values to check if each example is missing or not.

Originally published at imaddabbura.github.io on March 15, 2018.

Predicting Loan Repayment was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Character-Level Language Model

Imad Dabbura — Thu, 22 Feb 2018 00:00:00 GMT

Iphone’s text suggestion

Introduction

Have you ever wondered how Gmail automatic reply works? Or how your phone suggests next word when texting? Or even how a Neural Network can generate musical notes? The general way of generating a sequence of text is to train a model to predict the next word/character given all previous words/characters. Such model is called a Statistical Language Model. What is a statistical language model? A statistical language model tries to capture the statistical structure (latent space) of training text it’s trained on. Usually Recurrent Neural Network (RNN) models family are used to train the model due to the fact that they are very powerful and expressive in which they remember and process past information through their high dimensional hidden state units. The main goal of any language model is to learn the joint probability distribution of sequences of characters/words in a training text, i.e. trying to learn the joint probability function. For example, if we’re trying to predict a sequence of T words, we try to get the joint probability P(w_1,w_2, …,w_T) as big as we can which is equal to the product of all conditional probabilities ∏ P(w_t/w_{t-1}) at all time steps (t) .

In this post, we’ll cover the Character-Level Language Model where almost all the concepts hold for any other language models such as word-language models. The main task of the character-level language model is to predict the next character given all previous characters in a sequence of data, i.e. generates text character by character. More formally, given a training sequence (x¹, … , x^T), the RNN uses the sequence of its output vectors (o¹, … , o^T) to obtain a sequence of predictive distributions P(x^t/x^{t-1}) = softmax(o^t).

Let’s illustrate how the character-level language model works using my first name (“imad”) as an example (see figure 1 for all the details of this example).

We first build a vocabulary dictionary using all the unique letters of the names in the corpus as keys and the index of each letter starting from zero (since Python is a zero-indexed language) in an ascending order. For our example, the vocabulary dictionary would be: {“a”: 0, “d”: 1, “i”: 2, “m”: 3}. Therefore, “imad” would become a list of the following integers: [2, 3, 0, 1].
Convert the input and the output characters to lists of integers using the vocabulary dictionary. In this post, we’ll assume that x¹ = \vec{0} for all examples. Therefore, y = “imad” and x = \vec{0}\ + “ima”. In other words, x^{t + 1} = y^t which gives us: y = [2, 3, 0, 1] and x = [\vec{0}, 2, 3, 0].
For each character in the input:
Convert the input characters into one-hot vectors. Notice how the first character is: zero vector
Compute the hidden state layer.
Compute the output layer and then pass it through softmax to get the results as probabilities.
Feed the target character at time step (t) as the input character at time step (t + 1).
Go back to step A and repeat until we finish all the letters in the name.

The objective is to make the green numbers as big as we can and the red numbers as small as we can in the probability distribution layer. The reason is that the true index should have the highest probability by making it as close as we can to 1. The way to do that is to measure the loss using cross-entropy and then compute the gradients of the loss w.r.t. all parameters to update them in the opposite of the gradient direction. Repeating the process over many times where each time we adjust the parameters based on the gradient direction –> model will be able to correctly predict next characters given all previous ones using all names in the training text. Notice that hidden state $h⁴$ has all past information about all characters.

Figure 1: Illustrative example of character-level language model using RNN

Note: To shorten the length of the post, I deleted all the docstrings of python functions and I didn’t include some functions that i didn’t think are necessary to understand the main concepts. The notebook and the script that created this post can be found here and here.

Training

The dataset we’ll be using has 5,163 names: 4,275 male names, 1,219 female names, and 331 names that can be both female and male names. The RNN architecture we’ll be using to train the character-level language model is called many to many where time steps of the input T_x= time steps of the output T_y. In other words, the sequence of the input and output are synced (see figure 2).

Figure 2: RNN architecture: many to many

The character-level language model will be trained on names; which means after we’re done with training the model, we’ll be able to generate some interesting names :).

In this section, we’ll go over four main parts:

Forward propagation.
Backpropagation.
Sampling.
Fitting the model.

Forward Propagation

We’ll be using Stochastic Gradient Descent (SGD) where each batch consists of only one example. In other words, the RNN model will learn from each example (name) separately, i.e. run both forward and backward passes on each example and update parameters accordingly. Below are all the steps needed for a forward pass:

Create a vocabulary dictionary using the unique lower case letters.
Create a character to index dictionary that maps each character to its corresponding index in an ascending order. For example, “a” would have index 1 (since python is a zero index language and we’ll reserve 0 index to EOS “\n”) and “z” would have index 26. We will use this dictionary in converting names into lists of integers where each letter will be represented as one-hot vector.
Create an index to character dictionary that maps indices to characters. This dictionary will be used to convert the output of the RNN model into characters which will be translated into names.
Initialize parameters: weights will be initialized to small random numbers from standard normal distribution to break symmetry and make sure different hidden units learn different things. On the other hand, biases will be initialized to zeros.
W_hh: weight matrix connecting previous hidden state h^{t — 1} to current hidden state h^t.
W_xh: weight matrix connecting input x^t to hidden state h^t.
b: hidden state bias vector.
W_hy: weight matrix connecting hidden state h^t to output o^t.
c: output bias vector.
Convert input x^t and output y^t into one-hot vector each. The dimension of the one-hot vector is vocab_size x 1. Everything will be zero except for the index of the letter at (t) would be 1. In our case, x^t would be the same as y^t shifted to the left where x¹ = \vec{0}; however, starting from t = 2, x^{t + 1} = y^t. For example, if we use “imad” as the input, then y = [3, 4, 1, 2, 0] while x = [\vec{0}, 3, 4, 1, 2]. Notice that x¹ = \vec{0} and not the index 0. Moreover, we’re using “\n” as EOS (end of sentence/name) for each name so that the RNN learns “\n” as any other character. This will help the network learn when to to stop generating characters. Therefore, the last target character for all names will be “\n” that represents the end of the name.
Compute the hidden state using the following formula:

Notice that we use hyperbolic tangent as the non-linear function. One of the main advantages of the hyperbolic tangent function is that it resembles the identity function.

Compute the output layer using the following formula:

Pass the output through softmax layer to normalize the output that allows us to express it as a probability, i.e. all output will be between 0 and 1 and sum up to 1. Below is the softmax formula:

The softmax layer has the same dimension as the output layer which is vocab_size x 1. As a result, y^t[i] is the probability of index i being the next character at time step (t).

As mentioned before, the objective of a character-level language model is to minimize the negative log-likelihood of the training sequence. Therefore, the loss function at time step (t) and the total loss across all time steps are:

Since we’ll be using SGD, the loss will be noisy and have many oscillations, so it’s a good practice to smooth out the loss using exponential weighted average.

Pass the target character y^t as the next input x^{t + 1} until we finish the sequence.

https://medium.com/media/7342e20dec97f04a0d5d28c11cc468ed/href

Backpropagation

With RNN based models, the gradient-based technique that will be used is called Backpropagation Through Time (BPTT). We start at last time step $T$ and backpropagate loss function w.r.t. all parameters across all time steps and sum them up (see figure 3).

Figure 3: Backpropagation Through Time (BPTT)

In addition, since RNNs are known to have steep cliffs (sudden steep decrease in L, gradients may overshoot the minimum and undo a lot of the work that was done even if we are using adaptive learning methods such as RMSProp. The reason is because gradient is a linear approximation of the loss function and may not capture information further than the point it was evaluated on such as the curvature of loss curve. Therefore, it’s a common practice to clip the gradients to be in the interval [-maxValue, maxValue]. For this exercise, we’ll clip the gradients to be in the interval [-5, 5]. That means if the gradient is > 5 or < -5, it would be clipped to 5 and -5 respectively. Below are all the formulas needed to compute the gradients w.r.t. all parameters at all time steps.

Note that at last time step T, we’ll initialize dh_next to zeros since we can’t get values from future. To stabilize the update at each time step since SGD may have so many oscillations, we’ll be using one of the adaptive learning method optimizers. More specifically, we’ll use Root Mean Squared Propagation (RMSProp) which tends to have acceptable performance.

https://medium.com/media/48b0d05cc8a4385760bdc7e1b596caf8/href

Sampling

Sampling is what makes the text generated by the RNN at each time step an interesting/creative text. On each time step (t), the RNN output the conditional probability distribution of the next character given all the previous characters, i.e. P(c_t/c_1, c_2, …, c_{t-1}). Let’s assume that we are at time step t = 3 and we’re trying to predict the third character, the conditional probability distribution is: P(c_3/c_1, c_2) = (0.2, 0.3, 0.4, 0.1). We’ll have two extremes:

Maximum entropy: the character will be picked randomly using uniform probability distribution; which means that all characters in the vocabulary dictionary are equally likely. Therefore, we’ll end up with maximum randomness in picking the next character and the generated text will not be either meaningful or sound real.
Minimum entropy: the character with the highest conditional probability will be picked on each time step. That means next character will be what the model estimates to be the right one based on the training text and learned parameters. As a result, the names generated will be both meaningful and sound real. However, it will also be repetitive and not as interesting since all the parameters were optimized to learn joint probability distribution in predicting the next character.

As we increase randomness, text will lose local structure; however, as we decrease randomness, the generated text will sound more real and start to preserve its local structure. For this exercise, we will sample from the distribution that’s generated by the model which can be seen as an intermediate level of randomness between maximum and minimum entropy (see figure 4). Using this sampling strategy on the above distribution, the index 0 has 20% probability of being picked, while index 2 has 40% probability to be picked.

Figure 4: Sampling: An example of predicting next character using character-level language model

Therefore, sampling will be used at test time to generate names character by character.

https://medium.com/media/9fdacc8a5f83c79fb7b13c49389772ac/href

Fitting the model

After covering all the concepts/intuitions behind character-level language model, now we’re ready to fit the model. We’ll use the default settings for RMSProp’s hyperparameters and run the model for 100 iterations. On each iteration, we’ll print out one sampled name and smoothed loss to see how the names generated start to get more interesting with more iterations as well as the loss will start decreasing. When done with fitting the model, we’ll plot the loss function and generate some names.

https://medium.com/media/b5c83f3e31c6246feec40bf8229f5a03/href https://medium.com/media/39043304df065787587db22e77a172e4/href

Below is few of output generated during training:

There are 36121 characters and 27 unique characters.

Epoch 0
=======
Sampled name: Nijqikkgzst
Smoothed loss: 23.0709

Epoch 10
=======
Sampled name: Milton
Smoothed loss: 14.7446

Epoch 30
=======
Sampled name: Dangelyn
Smoothed loss: 13.8179

Epoch 70
=======
Sampled name: Lacira
Smoothed loss: 13.3782

Epoch 99
=======
Sampled name: Cathranda
Smoothed loss: 13.3380

Figure 6: Smoothed loss

The names that were generated started to get more interesting after 15 epochs. I didn’t include the results of all epochs to shorten the post; however, you can check the results in the notebook associated with this post. One of the interesting names is “Yasira” which is an Arabic name :).

Conclusion

Statistical language models are very crucial in Natural Language Processing (NLP) such as speech recognition and machine translation. We demonstrated in this post the main concepts behind statistical language models using character-level language model. The task of this model is to generate names character by character using names obtained from census data that were consisted of 5,163 names. Below are the main key takeaways:

If we have more data, a bigger model, and train longer, we may get more interesting results. However, to get a very interesting results, we should instead use Long Short-Term Memory (LSTM) model with more than one layer deep. People have used 3 layers deep LSTM model with dropout and were able to generate very interesting results when applied on cookbooks and Shakespeare poems. LSTM models outperform simple RNN due to its ability in capturing longer time dependencies.
With the sampling technique we’re using, don’t expect the RNN to generate meaningful sequence of characters (names).
We used in this post each name as its own sequence; however, we may be able to speed up learning and get better results if we increase the batch size; let’s say from one name to a sequence of 50 characters.
We can control the level of randomness using the sampling strategy. Here, we balanced between what the model thinks it’s the right character and the level of randomness.

Originally published at imaddabbura.github.io on February 22, 2018.

Character-Level Language Model was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Gradient Descent Algorithm and Its Variants

Imad Dabbura — Thu, 21 Dec 2017 18:54:51 GMT

Figure 1: Trajectory towards local minimum

Optimization refers to the task of minimizing/maximizing an objective function f(x) parameterized by x. In machine/deep learning terminology, it’s the task of minimizing the cost/loss function J(w) parameterized by the model’s parameters w ∈ R^d. Optimization algorithms (in case of minimization) have one of the following goals:

Find the global minimum of the objective function. This is feasible if the objective function is convex, i.e. any local minimum is a global minimum.
Find the lowest possible value of the objective function within its neighborhood. That’s usually the case if the objective function is not convex as the case in most deep learning problems.

There are three kinds of optimization algorithms:

Optimization algorithm that is not iterative and simply solves for one point.
Optimization algorithm that is iterative in nature and converges to acceptable solution regardless of the parameters initialization such as gradient descent applied to logistic regression.
Optimization algorithm that is iterative in nature and applied to a set of problems that have non-convex cost functions such as neural networks. Therefore, parameters’ initialization plays a critical role in speeding up convergence and achieving lower error rates.

Gradient Descent is the most common optimization algorithm in machine learning and deep learning. It is a first-order optimization algorithm. This means it only takes into account the first derivative when performing the updates on the parameters. On each iteration, we update the parameters in the opposite direction of the gradient of the objective function J(w) w.r.t the parameters where the gradient gives the direction of the steepest ascent. The size of the step we take on each iteration to reach the local minimum is determined by the learning rate α. Therefore, we follow the direction of the slope downhill until we reach a local minimum.

In this article, we’ll cover gradient descent algorithm and its variants: Batch Gradient Descent, Mini-batch Gradient Descent, and Stochastic Gradient Descent.

Let’s first see how gradient descent works on logistic regression before going into the details of its variants. For the sake of simplicity, let’s assume that the logistic regression model has only two parameters: weight w and bias b.

1. Initialize weight w and bias b to any random numbers.

2. Pick a value for the learning rate α. The learning rate determines how big the step would be on each iteration.

If α is very small, it would take long time to converge and become computationally expensive.
If α is large, it may fail to converge and overshoot the minimum.

Therefore, plot the cost function against different values of α and pick the value of α that is right before the first value that didn’t converge so that we would have a very fast learning algorithm that converges (see figure 2).

Figure 2: Gradient descent with different learning rates. Source

The most commonly used rates are : 0.001, 0.003, 0.01, 0.03, 0.1, 0.3.

3. Make sure to scale the data if it’s on a very different scales. If we don’t scale the data, the level curves (contours) would be narrower and taller which means it would take longer time to converge (see figure 3).

Figure 3: Gradient descent: normalized versus unnormalized level curves.

Scale the data to have μ = 0 and σ = 1. Below is the formula for scaling each example:

4. On each iteration, take the partial derivative of the cost function J(w) w.r.t each parameter (gradient):

The update equations are:

For the sake of illustration, let’s assume we don’t have bias. If the slope of the current value of w > 0, this means that we are to the right of optimal w*. Therefore, the update will be negative, and will start getting close to the optimal values of w*. However, if it’s negative, the update will be positive and will increase the current values of w to converge to the optimal values of w*(see figure 4):

Figure 4: Gradient descent. An illustration of how gradient descent algorithm uses the first derivative of the loss function to follow downhill it’s minimum.

Continue the process until the cost function converges. That is, until the error curve becomes flat and doesn’t change.
In addition, on each iteration, the step would be in the direction that gives the maximum change since it’s perpendicular to level curves at each step.

Now let’s discuss the three variants of gradient descent algorithm. The main difference between them is the amount of data we use when computing the gradients for each learning step. The trade-off between them is the accuracy of the gradient versus the time complexity to perform each parameter’s update (learning step).

Batch Gradient Descent

Batch Gradient Descent is when we sum up over all examples on each iteration when performing the updates to the parameters. Therefore, for each update, we have to sum over all examples:

for i in range(num_epochs):
    grad = compute_gradient(data, params)
    params = params — learning_rate * grad

The main advantages:

We can use fixed learning rate during training without worrying about learning rate decay.
It has straight trajectory towards the minimum and it is guaranteed to converge in theory to the global minimum if the loss function is convex and to a local minimum if the loss function is not convex.
It has unbiased estimate of gradients. The more the examples, the lower the standard error.

The main disadvantages:

Even though we can use vectorized implementation, it may still be slow to go over all examples especially when we have large datasets.
Each step of learning happens after going over all examples where some examples may be redundant and don’t contribute much to the update.

Mini-batch Gradient Descent

Instead of going over all examples, Mini-batch Gradient Descent sums up over lower number of examples based on the batch size. Therefore, learning happens on each mini-batch of b examples:

Shuffle the training data set to avoid pre-existing order of examples.
Partition the training data set into b mini-batches based on the batch size. If the training set size is not divisible by batch size, the remaining will be its own batch.

for i in range(num_epochs):
    np.random.shuffle(data)
    for batch in radom_minibatches(data, batch_size=32):
        grad = compute_gradient(batch, params)
        params = params — learning_rate * grad

The batch size is something we can tune. It is usually chosen as power of 2 such as 32, 64, 128, 256, 512, etc. The reason behind it is because some hardware such as GPUs achieve better run time with common batch sizes such as power of 2.

The main advantages:

Faster than Batch version because it goes through a lot less examples than Batch (all examples).
Randomly selecting examples will help avoid redundant examples or examples that are very similar that don’t contribute much to the learning.
With batch size < size of training set, it adds noise to the learning process that helps improving generalization error.
Even though with more examples the estimate would have lower standard error, the return is less than linear compared to the computational burden we incur.

The main disadvantages:

It won’t converge. On each iteration, the learning step may go back and forth due to the noise. Therefore, it wanders around the minimum region but never converges.
Due to the noise, the learning steps have more oscillations (see figure 4) and requires adding learning-decay to decrease the learning rate as we become closer to the minimum.

Figure 5: Gradient descent: batch versus mini-batch loss function

With large training datasets, we don’t usually need more than 2–10 passes over all training examples (epochs). Note: with batch size b = m (number of training examples), we get the Batch Gradient Descent.

Stochastic Gradient Descent

Instead of going through all examples, Stochastic Gradient Descent (SGD) performs the parameters update on each example (x^i,y^i). Therefore, learning happens on every example:

Shuffle the training data set to avoid pre-existing order of examples.
Partition the training data set into m examples.

for i in range(num_epochs):
    np.random.shuffle(data)
    for example in data:
        grad = compute_gradient(example, params)
        params = params — learning_rate * grad

It shares most of the advantages and the disadvantages with mini-batch version. Below are the ones that are specific to SGD:

It adds even more noise to the learning process than mini-batch that helps improving generalization error. However, this would increase the run time.
We can’t utilize vectorization over 1 example and becomes very slow. Also, the variance becomes large since we only use 1 example for each learning step.

Below is a graph that shows the gradient descent’s variants and their direction towards the minimum:

Figure 6: Gradient descent variants’ trajectory towards minimum

As the figure above shows, SGD direction is very noisy compared to mini-batch.

Challenges

Below are some challenges regarding gradient descent algorithm in general as well as its variants — mainly batch and mini-batch:

Gradient descent is a first-order optimization algorithm, which means it doesn’t take into account the second derivatives of the cost function. However, the curvature of the function affects the size of each learning step. The gradient measures the steepness of the curve but the second derivative measures the curvature of the curve. Therefore, if:

Second derivative = 0 →the curvature is linear. Therefore, the step size = the learning rate α.
Second derivative > 0 → the curvature is going upward. Therefore, the step size < the learning rate α and may lead to divergence.
Second derivative < 0 → the curvature is going downward. Therefore, the step size > the learning rate α.

As a result, the direction that looks promising to the gradient may not be so and may lead to slow the learning process or even diverge.

If Hessian matrix has poor conditioning number, i.e. the direction of the most curvature has much more curvature than the direction of the lowest curvature. This will lead the cost function to be very sensitive in some directions and insensitive in other directions. As a result, it will make it harder on the gradient because the direction that looks promising for the gradient may not lead to big changes in the cost function (see figure 7).

Figure 7: Gradient descent fails to exploit the curvature information contained in the Hessian matrix. Source

The norm of the gradient gTg is supposed to decrease slowly with each learning step because the curve is getting flatter and steepness of the curve will decrease. However, we see that the norm of the gradient is increasing, because of the curvature of the curve. Nonetheless, even though the gradients’ norm is increasing, we’re able to achieve a very low error rates (see figure 8).

Figure 8: Gradient norm. Source

In small dimensions, local minimum is common; however, in large dimensions, saddle points are more common. Saddle point is when the function curves up in some directions and curves down in other directions. In other words, saddle point looks like a minimum from one direction and a maximum from other direction (see figure 9). This happens when at least one eigenvalue of the hessian matrix is negative and the rest of eigenvalues are positive.

Figure 9: Saddle point

As discussed previously, choosing a proper learning rate is hard. Also, for mini-batch gradient descent, we have to adjust the learning rate during the training process to make sure it converges to the local minimum and not wander around it. Figuring out the decay rate of the learning rate is also hard and changes with different data sets.
All parameter updates have the same learning rate; however, we may want to perform larger updates to some parameters that have their directional derivatives more inline with the trajectory towards the minimum than other parameters.

Originally published at imaddabbura.github.io on December 21, 2017.

Gradient Descent Algorithm and Its Variants was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.