Ideas behind artificial intelligence — without maths!

Artificial intelligence is all around us. You use artificial intelligence algorithms every day without even realizing it. Today, face recognition helps you unlock your phone, Google translate helps you translate from any language to any other, Alexa recognizes your voice and helps you play your music, and you have cars which can drive themselves. Have you ever wondered how Netflix almost always manages to recommend just the right show for you? Behind all these things are powerful machine learning algorithms which are built upon really simple and clever ideas. Today, we are going to take a step back and try to understand the ideas behind these algorithms without going into the math behind them.

This article doesn’t assume any knowledge of machine learning from the reader and I have attempted to present the ideas in as simple manner as possible. We will quickly go over what is artificial intelligence and why we are talking about AI today before moving onto the ideas behind the algorithms.

What is Artificial Intelligence and why are we talking about it?

AI is typically defined as the ability of a machine to perform cognitive functions we associate with human minds, such as perceiving, reasoning, learning, interacting with the environment, problem solving, and even exercising creativity. Examples of technologies that enable AI to solve business problems are robotics and autonomous vehicles, computer vision, language, virtual agents, and machine learning.

A convergence of algorithmic advances, data proliferation, and tremendous increases in computing power and storage has propelled AI from hype to reality. Most of the ideas that we use in our algorithms are more than 2–3 decades old. It’s just that with the advancements in our ability to store and process large swathes of data that we are able to harness the true power of these algorithms.

Figure 1

Types of Algorithms in Artificial Intelligence

There are three kinds of algorithm classes in machine learning; supervised learning, unsupervised learning and re-inforcement learning. The idea behind this division is very simple. Any mathematical model has 3 parts:

Figure 2
  • Input Data
  • Approach
  • Output Data

This basic division gives us three different algorithm classes of machine learning

  • Unsupervised learning: Give the computer input data and a fixed approach and let the computer figure out some patterns in the input data. Here we don’t have an output data.
  • Supervised learning: Give the computer input data, a fixed approach and the expected output data
  • Re-inforcement learning: Give the computer input data and rules for an approach. In most cases, we won’t know what the output will look like but we will have defines the rules for the approach to figure out whether the output is good or bad.

And that’s pretty much it.!

Figure 3

Let us deep dive into these algorithm classes and look at some of the algorithms which are used around you without you even realizing it.

Unsupervised learning

Some of the algorithms in this class are:

  • k-means
  • k-NN (k nearest neighbours)
  • Gaussian mixture models

One of the most popular algorithms in unsupervised learning is called k-means. It is used to cluster a bunch of data points into certain number of groups; as you might have guessed it by now, the algorithm groups the data points into ‘k’ groups and hence the name k-means. A typical use case for this algorithm is in customer segmentation.

Here is the idea behind the algorithm

  • Imagine you are on a tennis court and you randomly throw a 100 white tennis balls on the court and you are tasked with finding some patterns in the positions of the spread. This is your data set, try imagining it from the top.
  • Next, you decide that you will identify four groups on this landscape. (four here represents the k in the k-means algorithm) You randomly throw four different coloured tennis balls on the court.
  • Next, you look at each white tennis ball and spot the nearest coloured ball to that white tennis ball. You paint the white ball the colour of the nearest coloured ball that you spot. So if the nearest coloured ball is blue, you paint the white tennis ball also blue. Do it for all white tennis balls.
  • This way you have created four groups but let’s be honest, this is not a good classification as you had just randomly thrown four coloured tennis balls. How can we improve this? Here the idea is that you evaluate each group of coloured tennis balls to find its centre and place a hypothetical tennis ball there. So imagine drawing a hypothetical circle around all the blue coloured tennis balls and at its centre you place a new blue tennis ball. Do this for all the four colour groups.
  • Apart from the coloured tennis balls which are at the centre of the group, re paint all the original balls white again and repeat the exercise of spotting the closest coloured ball.
  • This way you will keep refining the groups until you have a good representation of our original white coloured tennis balls.

Here is what it looks like

Figure 4

Couple of considerations here:

  • How do you know that initially when you tossed four coloured tennis balls, that toss was correct? You are right. We don’t. That is why we do this exercise of tossing coloured balls thousands of times to find what the right throw should be.
  • How do you know that four is a good number? Again, we don’t. We try numbers from 2 onward till we reach a point where we are convinced that we have found a good general representation of the data set.

Supervised Learning

There are a few different algorithms that we are going to look at here, namely;

  • Support vector machines
  • Artificial Neural networks
  • Convolutional neural networks
  • Recurrent neural networks

Typical use cases for supervised learning include; customer churn prediction, face recognition, object detection, machine translation, chat bots etc.

Support vector machines

The idea behind this algorithm is quite elegant actually. Here is the problem statement, suppose you are given points which belong to two classes on a straight line as given in the diagram below and are asked to find a single point which divides the two classes.

Figure 5

Pretty easy right? Even for computers. That blue line in the middle divides the two set of points fairly we can say. But imagine if your data was like the diagram given below and you are given the same task; divide the data points into two classes using a single point.

Figure 6

Not so easy this time. No matter which point you pick up you will not be able to divide the two set of points using a single point. The reason we are emphasizing on a single point is that it is linear in that dimension. For example, in the above case, a point is linear in the 1-d space, a line would be linear in a 2-d space. And the reason for that is that it is easier to find a linear classifier computationally.

Coming back to the problem at hand. The idea here is to map the data points to a higher dimensional space where linear classification is possible. So in the example above, we will map the points to a parabola in 2-d plane and use a line to classify the points.

Figure 7

And that is the basic idea behind SVM or support vector machines. We take a set of input data points and map it to a higher dimension where linear classification is easier. Here is a video demonstrating it in the case where data points are in the 2-d space.

Figure 8

Artificial Neural Networks

Here is how the neural network architecture looks like.

Figure 9

A simple neural network without any hidden layers (more on that later) is just a glorified weighted average with some clever ideas. Just like any other mathematical or statistical model we have to train a neural network. Let us look at it step by step to understand it a bit further.

We will use the case of handwritten digits an example to explain the ideas. We take a 28x28 photo of a handwritten digit and take the pixel density for each pixel, i.e., 784 values, which we store in a linear array. As you can see in the diagram above, we have an array of 784 data points which are connected to 10 nodes. A connection is denoted by a line and usually carries a weight. The nodes will contain the probability that the input photo is of that digit. Here is how the training works.

  • Take a sample of data points and multiply it with the corresponding weights for the connections.
  • In the example above, each node, denoted by a circle, will have 784 inputs. To aggregate the output from the previous layer, i.e., the 784 inputs, we use something called as an activation function which, through a mathematical function, will give out a single value from the previous set of values.
  • We tally these values from the activation function with the real answer whether the highest probability prediction is actually for the right digit or not. We use an error function to determine that. Our goal is to minimize the error function.
  • Now, initially our predictions will be way off. That is because we haven’t specified what the weights of those connections should be. But how do we do that? Here is where an idea called gradient descent comes in. Without going into the math behind gradient descent, what it does is that it tells us how to change the weights so that we can minimize our error function.
  • So, we use gradient descent to change our weights and go through this process again and again with new data points and each time adjusting our weights till a desired level of accuracy. The algorithm for gradient descent also requires a parameter to dictate how fast we should change our weights.
Figure 10

And that is what broadly happens when we train a neural network. If we look back at the process, we had a few decisions to make which define our neural network

  • We had to choose what our sample size should be
  • The choice of the mathematical function for aggregating the information from the previous layer
  • Choice of error function
  • Gradient descent parameter

People tried this model and it worked! Someone then suggested that we should add another layer in between using the same idea and then make predictions using the information in the intermediate layer. That intermediate layer is called the hidden layer. This also becomes a decision point; we need to decide how many different hidden layer we need.

Another popular idea emerged when people started noticing that this architecture had started memorizing the data instead of finding patterns in the data. Typical case of over-fitting for those who are aware of how modelling works. The idea to overcome over-fitting was to randomly make weights as zero for some connections. And it worked. The models started improving. This again becomes a decision point; how many weights do you make zero.

Figure 11

It is the combination of these decision points that give us various different architectures for using neural networks for various use cases.

Convolutional Neural Networks

The idea behind convolutional networks is quite simple actually. The network what we saw above is what is called a fully connected network. Now imagine we want to train our network on bigger images, let’s say 300x300 pixels. If we connect all these to a single hidden layer with let’s say 10 nodes, then we will have 300x300x10 weights = ~1million weights. Now that is a lot of weights. It becomes computationally infeasible to optimize so many weights. Hence a new idea of convolutional nets was born. The idea is that we will define layers which will have a very small set of weights and scan the image using those weights. It is like holding a torch and scanning the photo trying to find something in the photo.

Figure 12

And the idea is that you do this multiple times using a different torch each time. The objective for each torch is to find something distinct and learn to identify that pattern. For example, in face recognition, one torch will only try to find if it can find a pair of eyes or not, etc. You then take what is torch has found and aggregate that information and pass it to subsequent layers. This idea worked tremendously well for images and defined a new field of computer vision. Here is how the architecture looks like

Figure 13

Recurrent Neural networks

There is a common theme in all the neural network architectures that we have seen above. The take information from the previous layer and pass it to the next layer. These are called feed-forward neural networks. But someone asked the question, why can’t we use the information in the intermediate layers to pass it again to the neural networks? Is there some information in the intermediate layers which might be useful? Turns out yes. This has a natural application in predicting speech! It is important to know what the previous few words were before predicting the next word. And hence the idea of recurrent neural networks was born.

Figure 14
Figure 15

Researchers found that when this architecture is stacked one after the other multiple times, it gives really good results in applications like machine translation, word prediction etc. But researchers soon ran into another problem. Even though these architectures were able to predict, let’s say, the next word but these networks were not able to capture context. That is because we are essentially overwriting on the information that we capture earlier and passing it to the neural network. The idea proposed was quite simple; have two channels as input, use one channel as an input for the neural network and keep the other channel like a conveyor belt where you don’t mess with the data and keep adding your information on top of it. Researchers tried various architectures based on this idea and it worked and hence LSTMs were born; Long Short Term memory. These models improved our understanding of context in passages and helped us improve things like machine translation and other applications in natural language processing.

Re-inforcement Learning

Reinforcement learning is an area of machine learning inspired by behaviourist psychology concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward. An algorithm learns to perform a task simply by trying to maximize rewards it receives for its actions. Re-inforcement learning differs from supervised and unsupervised learning because we do not have to give the correct input/output pairs and we don’t even define a correct course of action; we let the algorithm learn and evaluate the correct course of action.

A typical problem in re-inforcement learning is defined by 5 things:

  • State of the environment which the algorithm interacts with
  • Action taken by the algorithm by evaluating the state
  • Reward that the algorithm receives for its actions
  • Value to assess the various courses of actions
  • Policy is the final course of action that the algorithm takes

The objective of re-inforcement learning is to come up with a policy which will maximize the reward by taking actions based on its evaluation of the value of the various states possible.

Re-inforcement learning is typically used where the number of possibilities are huge and there is no define rule for obtaining the result, however, you can assess your position at each stage and take an informed decision for the right course of action. The most common example of re-inforcement learning that you would have heard of is Alpha Go, the algorithm which beat the world’s best player in the game of Go. Other applications of re-inforcement learning include chess, self-driving cars and optimizing returns of a portfolio.

Figure 16

Another example is that of the game of chess. Chess has a huge number of possibilities and it is computationally impossible to hard code all the possibilities onto a computer. Each game ends differently so there is no input/output mapping in the game of chess. Also, as the game progresses, one can only tell whether the latest move played is a good move or not by looking at the pieces and their position on the board. It is difficult to formulate this problem into either supervised or unsupervised learning as we are neither dealing with defined input/output pairs and nor are we simply trying to understand patterns in the dataset. Keeping in mind these challenges one can formulate the game of chess into a re-inforcement learning problem where the algorithm evaluates intermediate positions to propose the best course of action. Similarly, in the case of self-driving cars, it is a tedious task to map all the roads and it is practically impossible to hard code all possible scenarios that can occur while driving. However, we can train an algorithm which gets rewarded for good driving and gets penalized for bad driving. Overtime, with re-inforcement learning, the algorithm learns how to drive. And in a nutshell, this is how re-inforcement learning takes place; it is a loop of interactions and rewards that determine the course of action for an algorithm.

And voila! You have covered the tip of the iceberg of machine learning and artificial intelligence. Let me know what you think of the article! If you wish to ask something particular, feel free to connect with me on LinkedIn —

Check out this space for more articles in the future! Let me know in the comments section if you want me to cover any of these topics in more detail.