Deep learning series 1: Intro to deep learning

Published in

Intro to Artificial Intelligence

12 min readApr 23, 2018

source: https://www.jackphan.com/4-essential-business-relationships-that-artificial-intelligence-is-better-at-than-you-are

Currently, AI is advancing in a great pace and deep learning is one of the contributor to that. It is good to understand the basics of deep learning as they are changing the world we live. This is the first article in deep learning series and will explain different deep learning models in coming articles in the series. The content is inspired from the Udacity deep learning course and some of the images are taken from the course. If you would like to learn deep learning in details, I encourage you to enrol to the Udacity’s course.

1. Deep Learning

Deep learning is a sub-field of machine learning dealing with algorithms inspired by the structure and function of the brain called artificial neural networks. In other words, It mirrors the functioning of our brains. Deep learning algorithms are similar to how nervous system structured where each neuron connected each other and passing information.

Deep learning models work in layers and a typical model atleast have three layers. Each layer accepts the information from previous and pass it on to the next one.

Slide by Andrew Ng, all rights reserved.

Deep learning models tend to perform well with amount ofdata wheras old machine learning models stops improving after a saturation point.

One of differences between machine learning and deep learning model is on the feature extraction area. Feature extraction is done by human in machine learning whereas deep learning model figure out by itself.

2. Linear/Logistic Regression

We cannot start deep learning without explaining linear and logistics regression which is the basis of deep learning.

Linear regression

It is a statistical method that allows us to summarise and study relationships between two continuous (quantitative) variables.

In this example, we have historical data based on the size of the house. We plot them into the graph as seen as dot points. Linear regression is the technique where finding a straight line between these points with less error(this will be explained later). Once we have a line with less error, we can predict the house price based on the size of the house.

Here is another example how linear regression predict in a joke manner.

Logistic regression

It is a statistical method for analysing a dataset in which there are one or more independent variables that determine an outcome. The outcome is measured in which there are only two possible outcomes: True or False.

In this example, we have historical dataset of student which have passed and not passed based on the grades and test scores. If we need to know a student will pass or not based on the grade and test score, logistic regression can be used. In logistic regression, similar to linear regression, it will find best possible straight line that separate the two classification(passed and not passed).

3. Activation Function

Activation functions are functions that decide, given the inputs into the node, what should be the node’s output? Because it’s the activation function that decides the actual output, we often refer to the outputs of a layer as its “activations”.

One of the simplest activation functions is the Heaviside step function. This function returns a 0 if the linear combination is less than 0. It returns a 1 if the linear combination is positive or equal to zero.

The output unit returns the result of f(h), where h is the input to the output unit:

4. Weights

When input data comes into a neuron, it gets multiplied by a weight value that is assigned to this particular input. For example, the neuron above university example have two inputs, tests for test scores and grades, so it has two associated weights that can be adjusted individually.

Use of weights

These weights start out as random values, and as the neural network learns more about what kind of input data leads to a student being accepted into a university, the network adjusts the weights based on any errors in categorization that the previous weights resulted in. This is called training the neural network.

Remember we can associate weight as m(slope) in the orginal linear equation.

y = mx+b

5. Bias

Weights and biases are the learnable parameters of the deep learning models.

Bias represented as b in the above linear equation.

Reference:

https://www.quora.com/What-is-bias-in-artificial-neural-network

6. Neural Network

As explained above, deep learning is a sub-field of machine learning dealing with algorithms inspired by the structure and function of the brain called artificial neural networks. I will explain here how we can construct a simple neural network from the example. In the above example, Logistic regression is the technique to be used to separate data using single line. But most of the time we cannot classify the dataset using a single line with high accuracy.

How about if we separate, data points with two lines.

In this case, we say anything below blue line will be “No(not passed)” and above it will be “Yes(passed)”. Similarly, we say anything on the left side will be “No(not passed)” and on the right side “Yes(passed)”.

As we have neurons in nervous system, we can define each line as one neuron and connected to next layer neurons along with neurons in the same layer. In this case we have two neurons that represents the two lines. The above picture is an example of simple neural network where two neurons accept that input data and compute yes or no based on their condition and pass it to the second layer neuron to concatenate the result from previous layer. For this specific example test score 1 and grade 8 input, the output will be “Not passed” which is accurate, but in logistic regression out we may get as “passed”. To summarise this, using multiple neurons in different layers, essentially we can increase the accuracy of the model. This is the basis of neural network.

The diagram below shows a simple network. The linear combination of the weights, inputs, and bias form the input h, which passes through the activation function f(h), giving the final output, labeled y.

The good fact about this architecture, and what makes neural networks possible, is that the activation function, f(h) can be any function, not just the step function shown earlier.
For example, if you let f(h)=h, the output will be the same as the input. Now the output of the network is

This equation should be familiar to you, it’s the same as the linear regression model!
Other activation functions you’ll see are the logistic (often called the sigmoid), tanh, and softmax functions.

sigmoid(x)=1/(1+e−x)

The sigmoid function is bounded between 0 and 1, and as an output can be interpreted as a probability for success. It turns out, again, using a sigmoid as the activation function results in the same formulation as logistic regression.

We can finally say output of the simple neural network based on sigmoid as below:

I will touch about learning process of neural networks breifly later, but in depth detail of learning of a particlular model will be explained in coming articles in the series.

7. Other important concepts of neural networks

Training

Weights start out as random values, and as the neural network learns more about what kind of input data leads to a student being accepted into a university(above example), the network adjusts the weights based on any errors in categorization that the previous weights resulted in. This is called training the neural network. Once we have the trained network, we can use it for predicting the output for the similar input.

Error

This very important concept to define how well a network performing during the training. In the training phase of the network, it make use of error value to adjust the weights so that it can get reduced error at each step. The goal of the training phase to minimize the error

Mean Squared Error is one of the popular error function. it is a modified version Sum Squared Error.

Or we can write MSE as:

MSE formula, Source: Udacity deep learning course

Forward Propagation

By propagating values from the first layer (the input layer) through all the mathematical functions represented by each node, the network outputs a value. This process is called a forward pass.

Code for implementing the forward propagation using numpy:

import numpy as np

def sigmoid(x):
    """
    Calculate sigmoid
    """
    return 1/(1+np.exp(-x))

# Network size
N_input = 4
N_hidden = 3
N_output = 2

np.random.seed(42)
# Make some fake data
X = np.random.randn(4)

weights_input_to_hidden = np.random.normal(0, scale=0.1, size=(N_input, N_hidden))
weights_hidden_to_output = np.random.normal(0, scale=0.1, size=(N_hidden, N_output))


# TODO: Make a forward pass through the network

hidden_layer_in = np.dot(X, weights_input_to_hidden)
hidden_layer_out = sigmoid(hidden_layer_in)

print('Hidden-layer Output:')
print(hidden_layer_out)

output_layer_in = np.dot(hidden_layer_out, weights_hidden_to_output)
output_layer_out = sigmoid(output_layer_in)

print('Output-layer Output:')
print(output_layer_out)

Gradient Descent

Gradient descent is an optimization algorithm used to find the values of parameters (coefficients) of a function (f) that minimizes a cost function (cost).Gradient descent is best used when the parameters cannot be calculated analytically (e.g. using linear algebra) and must be searched for by an optimization algorithm.Gradient descent is used to find the minimum error by minimizing a “cost” function.

In the university example(explained it in the neural network section), the correct lines to divide the dataset is already defined. How does we find the correct line? As we know, weights are adjusted during the training process. Adjusting the weight will enable each neuron to correctly divide the dataset with given dataset.

To figure out how we’re going to find these weights, start by thinking about the goal. We want the network to make predictions as close as possible to the real values. To measure this, we need a metric of how wrong the predictions are, the error. A common metric is the sum of the squared errors (SSE):

where y^ is the prediction and y is the true value, and you take the sum over all output units j and another sum over all data points μ.

The SSE is a good choice for a few reasons. The square ensures the error is always positive and larger errors are penalized more than smaller errors. Also, it makes the math nice, always a plus.
Remember that the output of a neural network, the prediction, depends on the weights

and accordingly the error depends on the weights

We want the network’s prediction error to be as small as possible and the weights are the knobs we can use to make that happen. Our goal is to find weights wij that minimize the squared error E. To do this with a neural network, typically we use gradient descent.

With gradient descent, we take multiple small steps towards our goal. In this case, we want to change the weights in steps that reduce the error. Continuing the analogy, the error is our mountain and we want to get to the bottom. Since the fastest way down a mountain is in the steepest direction, the steps taken should be in the direction that minimizes the error the most.

Back Propagation

In neural networks, you forward propagate to get the output and compare it with the real value to get the error. Now, to minimise the error, you propagate backwards by finding the derivative of error with respect to each weight and then subtracting this value from the weight value. This is called back propagation.

Before, we saw how to update weights with gradient descent. The back propagation algorithm is just an extension of that, using the chain rule to find the error with the respect to the weights connecting the input layer to the hidden layer (for a two layer network).

Here is the back propagation algorithm from Udacity:

Code for implementing the propagation in numpy:

import numpy as np
from data_prep import features, targets, features_test, targets_test

np.random.seed(21)

def sigmoid(x):
    """
    Calculate sigmoid
    """
    return 1 / (1 + np.exp(-x))


# Hyperparameters
n_hidden = 2  # number of hidden units
epochs = 900
learnrate = 0.005

n_records, n_features = features.shape
last_loss = None
# Initialize weights
weights_input_hidden = np.random.normal(scale=1 / n_features ** .5,
                                        size=(n_features, n_hidden))
weights_hidden_output = np.random.normal(scale=1 / n_features ** .5,
                                         size=n_hidden)

for e in range(epochs):
    del_w_input_hidden = np.zeros(weights_input_hidden.shape)
    del_w_hidden_output = np.zeros(weights_hidden_output.shape)
    for x, y in zip(features.values, targets):
        ## Forward pass ##
        # TODO: Calculate the output
        hidden_input = np.dot(x, weights_input_hidden)
        hidden_output = sigmoid(hidden_input)

        output = sigmoid(np.dot(hidden_output,
                                weights_hidden_output))

        ## Backward pass ##
        # TODO: Calculate the network's prediction error
        error = y - output

        # TODO: Calculate error term for the output unit
        output_error_term = error * output * (1 - output)

        ## propagate errors to hidden layer

        # TODO: Calculate the hidden layer's contribution to the error
        hidden_error = np.dot(output_error_term, weights_hidden_output)

        # TODO: Calculate the error term for the hidden layer
        hidden_error_term = hidden_error * hidden_output * (1 - hidden_output)

        # TODO: Update the change in weights
        del_w_hidden_output += output_error_term * hidden_output
        del_w_input_hidden += hidden_error_term * x[:, None]

    # TODO: Update weights
    weights_input_hidden += learnrate * del_w_input_hidden / n_records
    weights_hidden_output += learnrate * del_w_hidden_output / n_records

    # Printing out the mean square error on the training set
    if e % (epochs / 10) == 0:
        hidden_output = sigmoid(np.dot(x, weights_input_hidden))
        out = sigmoid(np.dot(hidden_output,
                             weights_hidden_output))
        loss = np.mean((out - targets) ** 2)

        if last_loss and last_loss < loss:
            print("Train loss: ", loss, "  WARNING - Loss Increasing")
        else:
            print("Train loss: ", loss)
        last_loss = loss

# Calculate accuracy on test data
hidden = sigmoid(np.dot(features_test, weights_input_hidden))
out = sigmoid(np.dot(hidden, weights_hidden_output))
predictions = out > 0.5
accuracy = np.mean(predictions == targets_test)
print("Prediction accuracy: {:.3f}".format(accuracy))

Reference:

Regularisation

Regularisation is the technique used to solve the over-fitting problem. Over-fitting happens when model is biased to one type of datset. There are different types of regularisation techniques, I think the mostly used regularisation is dropout.

Dropout is a regularization technique for reducing overfitting in neural networks by preventing complex co-adaptations on training data. It is a very efficient way of performing model averaging with neural networks.[1] The term “dropout” refers to dropping out units (both hidden and visible) in a neural network.(Definition from wikipedia)

During the training, randomly selected neurons are not considered. We can set the number of neurons for the dropout. Their contribution to the activation of downstream neurons is temporally removed on the forward pass and any weight updates are not applied to the neuron on the backward pass. I think best practise is to remove 20 % of neurons.

Reference:

https://machinelearningmastery.com/dropout-regularization-deep-learning-models-keras/

Optimisation

Optimisation is technique used to minimize the loss function of the network. There are different type of optimisation algorithms. However, Gradient decent and it’s variants are popular ones these days.

Deep learning series

Reference

If you would like to see code in action for neural network in python, visit my github repo

Deep learning series 1: Intro to deep learning

1. Deep Learning

2. Linear/Logistic Regression

3. Activation Function

4. Weights

5. Bias

6. Neural Network

7. Other important concepts of neural networks

Training

Error

Forward Propagation

Gradient Descent

Back Propagation

Regularisation

Optimisation

Deep learning series

Reference

Written by Dhanoop Karunakaran