Intro and Background
This is an introduction to Neural Networks where I attempt to distill high level Artificial Intelligence concepts into plain English and everyday analogies.
This guide is intended for those who have no understanding of the sub-field of Deep Learning and would like to get a broad overview. This guide is not meant to be comprehensive!
Before we begin, we need to see the bigger picture of Artificial Intelligence(AI) and to see where neural networks fits in. Neural networks fall under umbrella of Deep Learning(DL), a subset of Machine Learning(ML), which is a subset of AI. Data Science(DS) overlaps with some areas of AI, ML, and DL. The image to the left illustrates the overlaps.
What We Will Cover
- Why use Artificial Neural Networks (ANNs)
- Popular uses of Neural Networks in Deep Learning
- Popular Neural Networks
- Biological Neurons, Artificial Neural Network, Perceptron
- Inputs, weights, activation function, output
- Weight initialization, training, and bias
- Simple mathematical representations
- Machine Learning concepts: classification, regression, hyper-parameter optimization, decision boundary, one-hot encoding, integer encoding
What We Will Not Cover
- Optimization, Gradient Descent, Back Propagation
- Exploding and vanishing gradients
Why Use An Artificial Neural Network?
Is it because ANN’s are cool? Yes, ANN’s are pretty cool, but that’s not why we use them. We can use many different machine learning modeling methods instead of a neural network inspired models. Note that in many scenarios it may be more beneficial to use traditional machine learning models such as Logistic Regression or Decision Trees as these models offer the ability to interpret the features (inputs) and how important (or useless) each feature is.
So one reason to use ANN’s in general is that they often offer higher performing results over traditional modeling methods, especially in complex tasks such as image recognition (also called object/image classification)— identifying a cat vs. dog (a single object). This is not to be confused with object detection, where the goal is to find the location of multiple objects that may exist in an image and put a bounding box around it. Putting a bounding box around a recognized object is known as object localization.
What is meant by complex tasks is — when the ability to express features explicitly is inherently difficult (i.e. nested if-else conditions, and logic). This is where an ANN shines. Without having to create thousands of human labeled features, an ANN will find the features that are hidden and automatically determine which features are important. Neural networks by default automatically perform feature engineering — a form a machine learning that largely requires human intelligence, and a bit of industry knowledge. In many ways ANN’s behave like black boxes, so it can be difficult to interpret the kind of feature engineering that occurs.
Popular Uses of Neural Networks (in 2018)
As mentioned above, neural networks are widely used in image recognition tasks. Other popular uses of neural networks include:
- Speech recognition
- Natural language processing (NLP): sentiment analysis, text classification, machine translation, etc.
- Art/music generation
Popular Neural Networks
- Convolutional Neural Network (ConvNet, CNN) → used extensively in image recognition tasks.
- Recurrent Neural Network (RNN) → used extensively in sequence prediction. Can be used for predicting the next value in time series data or predicting the next word in NLP tasks. Has a downside of not being capable of retaining long term dependencies (or memory) between inputs and outputs. Has problems with exploding and vanishing gradients where sequences of data are very large.
- Long Short Term Memory (LSTM) → a type of RNN that retains long term dependencies using a concept known as cell state. Also overcomes exploding and vanishing gradients.
- Gated Recurrent Unit (GRU) → a recurrent net based on LSTM but faster to train.
Neurons (Technical Content Begins Here)
Before we discuss neural networks in length, we need to begin with a Neuron, the most basic unit in a neural network. Notice that biological neurons have a web of tentacles (dendrites) going in all sorts of directions. More importantly, there is information going into the neuron (in) and information leaving the neuron (out). I am italicizing information here to emphasize that information is an abstraction which can represent biological signals or digital signals. The Perceptron was inspired by the biological neuron. Below, we will see how a perceptron behaves.
Next, we need to understand the perceptron. The perceptron builds on the idea of a biological neuron. Primarily, the Perceptron is a way to mathematically represent a biological network of neurons. It seeks to mimic how neurons fire and connect to other neurons. The main idea of a perceptron is it receives information and it either fires (1), or it doesn’t (0). We will see why this matters shortly.
Inputs are simply any kind of feature being passed into a neuron. It could be 1’s or 0’s. It could be dogs or cats (we can represent dogs as 1, and cats as 0). Inputs can be anything, however, in many scenarios, numbers are best suited, especially in neural networks.
I feel the need to elaborate on how cats and dogs could be represented in numerical terms such as 0's and 1's. This part is important, so pay attention. A categorical feature such as cat or dog can be expressed in numeric terms — for example:[0,1] can represent a cat feature, and [1,0] a dog feature. By representing a categorical feature into numeric terms, this allows a neural network to perform numeric calculations, namely vector/matrix multiplication. Such a data transformation is known as one-hot encoding. Another form of data transformation is converting categorical features into integer values such as 1, 2, 3,… so forth. An example would be: representing days of the week i.e. Monday, Tuesday, Wednesday as 1, 2, 3, and so forth. This is known as integer encoding. The main difference between one-hot encoding and integer encoding is that integer encoding assumes that there exists some natural relationship between each feature and one-hot encoding does not. In our dog and cat example, there is no natural ordering between the two features. On the other hand, in the days of the week example, there exists some natural ordering and relationship between each day of the week. In such a case, integer encoding may be a better way to represent days of the week. Similarly outputs may be encoded in this matter. We will discuss more on outputs below.
What are weights? Weights can be understood as how strong a particular piece of information(neuron) is connected to the another neuron. Weights can be any real number: 0, 1, 2.5, -0.0003 — you get the idea. Some information, actually a lot of information is useless! We can simple assign the input with a corresponding 0 valued weighting. This way, anything multiplied with 0 will be 0. The larger the value is away from zero, the larger the impact it will have upon the final weighted sum. If an input is important, then we can multiply it’s value by a weight of 1 instead of 0. Again, in the case of a perceptron, the weights themselves will take on either a 0 or a 1 value.
What is a weighted sum? A weighted sum is summation of all inputs multiplied by their corresponding weights. Said differently, inputs are multiplied by their respective weights then added together. This means each input is multiplied by an individualized weight value based upon the importance of the information. We will cover the math in more detail below.
What is an activation function? Simply put, we have gathered all the useful and useless information and now we are forced to make some kind of determination, judgement, or decision. Do we move forward with this information, or we do not? Yes or no? 1 or 0. The step function as shown in the image above is a type of activation function that takes on a binary value of either 0 or 1. See the discussion below on decision boundary to gain a deeper understanding of activation functions.
There are several different kinds of activation functions. The most commonly used activation functions are: Sigmoid, TanH (Hyperbolic Tangent), and ReLU (Rectified Linear Unit).
Sigmoid function: information that passes through a Sigmoid activation function can range from 0 to 1.
TanH function: information that passes through a TanH activation function can range from -1 to 1.
ReLU function: information that passes through a ReLU activation function will either result in 0 or x, where x is some value greater than or equal to 0.
Currently in 2018, the most widely used activation function is ReLU. A major advantage of using ReLU over Sigmoid or TanH is that ReLU overcomes the vanishing and exploding gradient problem (which is outside the scope of this article). A disadvantage of using ReLU is the dead activation units, which you can read more about here. Since this discussion on dead activation units requires an understanding of gradients, I will briefly say that it is similar to when weights are multiplied by 0.
A reason we use the Sigmoid, TanH, and ReLU functions is because they are differentiable. This will prove to be useful, mainly because we want a function that we can find the derivative to. This allows us to find the minimum of a loss/cost (error) function. (This is a topic of gradient descent, and optimization, which is outside of the scope of this article)
I need to briefly explain why the Sigmoid, TanH and ReLU activation functions are used. The primary purpose of activation functions are task specific. In other words, if the task at hand is classification, then the whole goal of the neural network is to produce some signal telling us whether an image is a cat or dog, for example. In order for our network to decide whether information is important in determining whether an image is a cat or dog, it needs a decision boundary. The reason this is an important and useful concept is that we want a function that can decide where different categories of data can be separated by a line, a plane, or surface. In the case of a Sigmoid function, we may assign 0 to be one category, and 1 to be another. Similarly, in the case of TanH, we may assign -1 to be one category and 1 to be another. In either function, we are trying to find the decision boundary that separates data points as accurately as possible.
Notice how a line can be drawn separating the blue circles from the red crosses? This is known as a decision boundary. Given enough information, a neural network can learn a decision boundary with a great degree of accuracy and success.
Okay, so what is a Neural Network? Lets take a look at the image above. We see an input layer, a hidden layer, and an output layer. A neural network is simply an interconnected web of neurons (inputs, hidden layers, outputs, etc.) sending signals to other neurons. Such signals are represented as weights and how important a connection is between the neurons. A neural network is said to be fully-connected (FC) if all the neurons in one layer is connected to each the neurons in the next layer, which is depicted in the image above.
Hidden layers are essentially a layer of neurons that help pass information from the input layer to the output layer. Each hidden layer of neurons contain an activation function that ties it to the next layer. Such an architecture is known as a feed-forward network because they feed on the previous layer before it until they reach the output layer.
The image above is also known as a Multi-layer Perceptron (MLP), where there is at least a single hidden layer between the input and output layers.
By way of analogy, we’ll use cities, roads, and bridges. An important and large city will have tons of roads and bridges connecting it to other cities. Each of those cities will be connected to other cities, bridges, and so forth. Some roads will inevitably be more traveled, and some less. The more traveled a road, and the more cars on the road, the wider a roadway will become. This is to say that the weight of important roadway will likely be greater than the weight of a road less traveled. Just look at the image to the left to get an idea of how important this road must be.
Similarly, weights that are large are like thick axons connecting two neurons together.
See how thick the connection is between xi,1 and f1,1? Such a thick connection will have a higher importance (and greater weight) when compared to the other two connections xi,1 with f1,2 and f1,3. We can see that all the inputs are fully connected with the each of the neurons in the two hidden layers and finally connected to the output layer. We can develop any number of hidden layers each with varying neurons in each layer. All this is to say that there exists some optimal architecture of neurons and layers between the input and output that we can find! This machine learning process is known as hyper-parameter tuning. By treating number of neurons and number of layers in a MLP as parameters, we can look for the optimal parameters that yields the best performance and minimizes error.
The ‘xi’ denotes the input of information. The ‘f’ denotes the activation function. The ‘yi’ with a hat over it denotes the predicted output that our network produces based on the weights of each connected neuron. The ‘i’ subscript denotes each value in a vector(list) of information. An input could be the following values [car, red, 1997, 3500lbs] and the output may be Honda Civic. This input could be the first value x1, second value x2, or the ith value xi. Again, car, red and Honda Civic would be encoded numerically.
I have left out the notation for weights in between each connection as doing so would easily get messy.
Weight Initialization, Update, and Bias
This is a good time for me to mention weight initialization. Weight initialization is kind of like first impressions. When we meet other individuals for the first time, we create an impression that we later update every time we gain new information about other individuals(and world around us). As we train the neural network, we would like to incrementally update our weights.
As we meet an individual well dressed, we may assume that the individual is wealthy, to which we assign a high initial weight. As times goes on, we learn more information. If new information agrees with out initial assessment/judgement about the individual, we increase the initial weight in the original direction. Alternatively, if we find out that the individual dresses well but squanders everything they earn, we would update our initial impression of them in the opposite direction, decreasing the weight we initially assigned them. This is how neurons in a neural network behave. Every connected neuron stores information that is incrementally updated as the neural network learns more through training. This update/training process is often called back-propagation and uses variations of the gradient descent algorithm. I will leave a link below for a more in-depth coverage of this topic.
As you can see, forming opinions and conclusions about given information can have some form of bias. For example, if an image of a cat or dog contains more orange pixels, a neural network may lean more towards a dog if it sees more orange pigmented dogs. In reality, orange pixels of a cats or dogs may give no more information of whether the image is a cat or dog. This bias will be incrementally updated during back-propagation as well.
Weight initialization is actually a bit more complicated than originally explained. An analogy of weight initialization behaves more like this. Continuing on the first impressions example, suppose each person in the United States represents a neuron. Each person has their own opinion and view of a presidential candidate. This we will represent as the first hidden layer. The second hidden layer can be represented as the House of Representatives. The third hidden layer can be represented by the Senate and finally, the output can select the final presidential candidate in a presidential election. You see, if the first layer of neurons is every person in the nation, there is no possible way for each person to agree on who they believe is best fit to take the highest office. Moreover, it would be terrible if every single person in the United States agreed on the same candidate. Likewise, it would be terrible, if each person in the House of Representatives and Senate are all of the same political party. This is precisely what a weight initialization in neural networks seek to accomplish — to make sure that there is enough randomness and that all neurons are not of the same initial value. In math and statistical terms, we want variability. But why? Remember when we discussed dead activation units? If all neurons are 0’s, there is no information to be learned in the network. Similarly, if all neurons are 1’s, there is again no information to be learned. To fully appreciate why is is important, a discussion of gradient descent required. This analogy of humans as neurons in a neural network falls apart in the context of gradient descent, but we haven’t discussed gradient descent in length, so it isn’t a problem.
In summary, weight initializations are kind of like first impressions, but in a neural network, they should have a high degree of variability, otherwise the network won’t learn anything during the update process.
Recall that weights (Wi) are multiplied by the inputs (xi). The weighted sum is then added to a bias term (b), a constant value. Notice how this function f(x) resembles y = mx + b? Curious isn’t it. The bias term just adjusts the weighted sum in a fixed direction. f(x) here simply denotes the summation of the weights times the inputs plus a bias term added.
Let’s tie up everything we’ve learned so far into a simple mathematical representation of inputs, weights and biases.
The f(x) equation to the left passes through an activation function (as outlined above). We can use the Sigmoid function, for example, as g(x) and chain the above equations up algebraically as g(f(x)).
Why do we even train a neural network or update weights anyways? Well, for the most part, we would like to make determinations about information. For example, if we would like to know whether an image is a cat or dog, our neural network should be able take in some features (inputs) and tell us if the image is a cat or dog (output). Moreover, we should be able to mathematically quantify how accurate our neural network model is (performance metric: accuracy, mean squared error, etc.). Additionally, we would like to minimize error (loss) through learning and updating the weights of the network— a topic of Gradient Descent. The more accurate our model, the less error we have and the better our neural network model is. When our neural network model generates an output from a class of outputs, we call this classification. When our neural network model generates outputs in a real number space, we call this a regression. An example of a regression model would be if we wanted to forecast unit sales in a given month, or predict user subscriptions. A Linear Regression model seeks to find a line that best fits (minimizes mean squared error) all the data points in a multi-dimensional space.
Outputs (output layer) can also use activation functions like the ones mentioned above. The main difference between the output function and activation function is that the output is the final layer of a neural network that tells us what the input information is predicted to be. Whereas, all the activation functions (hidden layers) between the input and output are essentially just a black box.
It is often the case that outputs are not binary (either 0 or 1). In scenarios where there are multiple classes of outputs, a softmax or logistic loss(a probability based metric) is used.
For example, the following example demonstrates single-class classification with a binary output (one versus rest):
The following example demonstrates multi-class classification with a softmax output:
In the first example, of the 5 classes, the input can only take on 1 class. If the input is an orange, it cannot be anything else. If the input is an apple, it cannot be anything else, and so on. In the second example, of the 5 classes, there is an 81% probability that the input is an orange. All outputs in a softmax will sum up to 1, or 100% probability.
So there you have it! Good job making it to the end of this article. If you were left wondering how to find the weights and biases, then you are on the right track. Such an explanation deserves a whole separate article on Gradient Descent, which I have purposely omitted from this article. I have left a link to a more comprehensive overview of Gradient Descent below, but I suspect it will be too math heavy for some.
Since I currently have no intention of going into the weeds of Gradient Descent, I’ll close with a quick overview of what it does superficially. Gradient Descent is a way to find the weights and biases of a neural network using a mix of programming, calculus, and some linear algebra. In the most simplistic terms, gradient descent seeks to find the minima of an error/cost function such as the activation functions discussed above. This is where differentiation comes into play. But the problem is that finding the solution to an equation is not easy. Gradient Descent solves this challenging problem of finding the solution by approximating the solution by iteratively (programmatically using a looping function) updating the weights and biases using a gradient — which is a multi-variable calculus concept (hence Gradient Descent). After every time step, the weights and biases are updated incrementally using a learning rate (alpha) and gradients of the weights and biases, respectively. This is where I stop, because gradient descent really deserves its own write up in order to cover all the nuances, particularly the math.
In closing, this post introduced many concepts that are widely used, including inputs, weights, activation functions, and introductory math that will build the foundation for understanding more complex neural network architectures in the future.
I have left a lot of gaps here in this blog post, specifically the areas involving a deep understand of mathematics. This includes optimization and gradients which requires understanding of multi-variable calculus and linear algebra.
I realize how difficult it is to explain many concepts without first laying down the foundation. I can only point my readers to the content that has helped me progress in my own journey. If you are truly serious about diving deep into the world of AI, I strongly recommend checking out the following resources to build up your foundation. Please follow the links below.
This is one of over 2,200 courses on OCW. Find materials for this course in the pages linked…
Linear Algebra | Khan Academy
Learn for free about math, art, computer programming, economics, physics, chemistry, biology, medicine, finance…
Multivariable Calculus | Khan Academy
Learn for free about math, art, computer programming, economics, physics, chemistry, biology, medicine, finance…
Probability & Statistics:
Introduction to Probability - The Science of Uncertainty
An introduction to probabilistic models, including random processes and the basic elements of statistical inference.
Algorithms & Data Structures:
Algorithm Design and Analysis
Learn about the core principles of computer science: algorithmic thinking and computational problem solving.
The Most Comprehensive Cheat Sheet on AI:
Cheat Sheets for AI, Neural Networks, Machine Learning, Deep Learning & Big Data
The Most Complete List of Best AI Cheat Sheets
Machine Learning & Deep Learning:
Online Artificial Intelligence / Machine Learning course
applied AI course attempts to teach students/course participants some of core ideas of the machine learning/ Data…
Introduction to Machine Learning | Machine Learning Crash Course | Google Developers
Gradient Descent - ML Cheatsheet documentation
Gradient descent is an optimization algorithm used to minimize some function by iteratively moving in the direction of…
CS231n Convolutional Neural Networks for Visual Recognition
Course materials and notes for Stanford class CS231n: Convolutional Neural Networks for Visual Recognition.
RNNs and LSTMs:
The Unreasonable Effectiveness of Recurrent Neural Networks
Musings of a Computer Scientist.
If you learned something, please leave a clap and share it with others who are interested in the field of AI. If you have any questions or found this blog post confusing, please leave a comment below. Thanks for reading.
Special thanks to Brandon Tong for valuable feedback on my earlier draft and pressing for deeper explanations that I have taken for granted.
Special thanks to Srikanth Varma Chekuri and the AppliedAICourse team for helping me learn many of these difficult concepts. Without their teaching and guidance, I would not know 90% of ML and DL I know today. Furthermore, they have challenged me to dig deeper by teaching what I’ve learned (through this blog post). This has helped me solidify my own understanding of deep learning.
This blog post was submitted as part of a blogging competition for Applied AI Course in which I received 2nd place. Please refer to the link below: