Artificial Intelligence, Machine Learning and Deep Learning Explained (As Best As I Can)

11 min readMay 16, 2022

Welcome to the series where I try to introduce anyone who’s excited to join the Data Science / Machine Learning community to the tools that are used in the field and what you should know before diving in to learn about them. You can check out the latest article here or you can head to my profile to see the previous posts where we talked about how Pandas can help you manipulate and structure your data and popular data visualization libraries.

This month I wanted to take a little break before we take a look at Machine Learning and Deep Learning frameworks. We’ll try to lift the curtain or the blackbox if you will and see how things work. We’ll learn about different levels of Artificial Intelligence (don’t worry, we’re not going to be hunted by robots anytime soon.), about artificial neurons and how they work and different types of Artificial Neural Network architectures. Let’s dive in!

Let’s Start With Some Definitions

It is often confusing to define what’s Artificial Intelligence for newcomers and for people who don’t work in tech so let’s just take a quick look at what AI is and how it differentiates from Machine Learning and Deep Learning.

To paint the bigger picture, we can say that Artificial Intelligence > Machine Learning > Deep Learning

Artificial Intelligence is actually nothing more than a computer’s ability to display human-like intelligence and learn to do things in time without supervision. Okay maybe that wasn’t a very good explanation but I don’t want to scare you off that’s all.

We can divide AI into three main categories, the first one being Narrow General Intelligence. Think of Apple’s personal assistant Siri or Tesla’s Autopilot, which are able to do some specific tasks really well but can not generalize that intelligence to other fields they are not familiar with or are not trained in. Siri can not write a book if you ask it to and Tesla’s Autopilot can not tell you what the weather looks like next week.

What we’re going to end up with in the near future is Artificial General Intelligence or AGI which will be able to help you with almost anything you ask it to like a personal assistant.

Lastly we’ll end our journey with Artificial Super Intelligence which will probably be the end of the road towards singularity and we’ll be facing a system that is like The Skynet in Terminator in terms of intelligence. (As a bonus, if you want to learn more about this I’d highly recommend Ray Kurzweil’s “The Singularity Is Near” book.

How About Machine Learning?

Machine Learning is a subfield of Artificial Intelligence and its algorithms power most of the systems we use as of today. The most common use cases of Machine Learning (mostly Deep Learning as well) are to classify things, guessing (getting a probability in other words) or clustering (grouping things that are similar to each other.) Let’s see how we can achieve these results using different approaches 👇🏻

We divide ML algorithms into three groups based on how they learn. I’m just going to go over them quickly without going into too much detail and ending up with a book instead of a Medium post.

Let’s say that we are dealing with a dataset that has information on a 1000 houses/apartments in a city. The features in the dataset are the number of rooms, number of bathrooms, size of the house in m2 etc.

Supervised Learning

First on the list is Supervised Learning which needs labelled data. By labelled I mean the target feature that we’re trying to predict is already populated by humans. Someone took the time and identified each of the 1000 samples either as a house or an apartment and formed a feature column with that data. We can say that a row with a room count of 5 and a bathroom count of 3 with a 300 m2 footprint is probably going to be a house rather than an apartment. So when we’re giving the data to the supervised learning algorithm, all of the columns except for the label column will form the X variable and the labels will be the Y variable. A part of X and Y will be set aside which is called the test data and the model will not be allowed to see that data so that our results do not have bias because we will adjust our parameters according to the results of the training data, which is the remaining part of X and Y. The algorithm will process the training data and will learn from it. Then as the features are fed to the trained model, we will have outputs of either a 0 or 1 indicating either a house or an apartment. Because the data that is fed into the model is labelled, this type of learning is called Supervised Learning and is mostly used for classification or regression tasks. Some of the most popular supervised learning algorithms are Support Vector Machines, K-Nearest Neighbor, Decision Trees and regression algorithms.

Unsupervised Learning

The use cases of unsupervised learning algorithms are mainly clustering which can be used for things like customer segmentation, product recommendations and anomaly detection and much more. We can accomplish the same task above where we tried to predict whether the datapoint is a house or an apartment, using unsupervised learning algorithms as well. In this case the algorithm will go through all the data without having any labels indicating whether that datapoint belongs to an apartment or a house. What it will do instead is decide based on the feature values and group the ones that are closer together. Like I said above, a datapoint with 5 rooms and 3 bathrooms is more likely to be a house rather than an apartment. And it will use the same intuition to classify the data into 2 clusters because we have two classes in this dataset. Some of the unsupervised learning algorithms commonly used are K-Means Clustering, Hierarchical Clustering and Principal Component Analysis (though it is a dimensionality reduction algorithm and not used for clustering or predictions.)

Reinforcement Learning

Now for our last step in ML algorithms, we’re going a bit more into the “AI territory” that most people are familiar with as an idea. We’re going to look at Reinforcement Learning (the closest thing to Deep Learning so far) which as the name suggests, is an algorithm where the model learns by using a positive or a negative feedback loop, much like how our brains learn. This can be put into the category of reward based models. To explain it as simply as possible, when the desired action is performed, that connection is strengthened and vice versa for any undesired action. The most popular use cases of RL is building agents that play games like chess/go or building systems that are used for robotic processes and automation in manufacturing. Or sometimes when people are bored, to teach a robot arm how to flip a pancake.

Deep Learning: The ✨Shiny✨Part

Finally let’s take a closer look at Deep Learning which uses Artificial Neural Networks made up of layers that are also made up of a lot of artificial neurons. The inputs that are given to the neurons are multiplied by the neuron’s weight and then a bias is added. the neurons use an activation function which takes the result from this calculation and based on that value, decides whether that neuron will be activated or not, meaning whether it will pass on its output to the next layer or not. The resulting values are passed on through the layers until it reaches the final output layer. This iterative process combined with the back propagation and gradient descent algorithms, allow the neuron’s output values to be optimized.

In Deep Learning, rather than algorithms, different network architectures are used which require different activation functions depending on the task that they’re trying to accomplish. These factors along with the hyperparameters of the network, control how it will learn and behave. Some of the most popular ANN architectures are Convolutional Neural Networks / CNN’s which are mostly used in Computer Vision applications and Recurrent Neural Networks / RNN’s that are used in the Natural Language Processing / NLP field. But we will have to save them for another post.

Deep Learning Glossary

Up to this point, we’ve had a lot of technical terms thrown around. Lastly let us clear up some of the ambiguity that the jargon brings like what an activation functions is, how does an artificial neuron work and what are weights and biases. Then we’ll wrap this one up.

What is an artificial neuron exactly?

We can define an artificial neuron as a node in a neural network which takes one or many inputs from the neurons before it or from the input of the model and then calculates an output based on the function chosen as a parameter. Because the neuron’s functions are non-linear, the outputs are unpredictable and do not necessarily mean that a change in one variable will cause a change in another one.

Artificial Neural Networks

An Artificial Neural Network is well, as the name suggests, a network of neurons that is made up of an input, an output and one or more hidden layers in the middle. For an ANN to be classified as deep, it has to have more than one hidden layer.

A diagram of an ANN that I’ve added just because it looks cool.

What is an Activation Function?

An activation’s main task in a neural network is to take the weighted input with the bias from the neuron and transform it, passing it on to the next layer of neurons. Activation functions are essentially what provide neural networks with non-linearity and allow them to make use of the back propagation algorithm to learn. One key point of activation functions is that they have to be differentiable in order for the model to improve its output and learn.

Without an activation function, all layers of the network will just be performing the same linear computation and will output the same result.

Activation Functions — Drawing originally from https://sefiks.com/2020/02/02/dance-moves-of-deep-learning-activation-functions/

What Are Weights, Biases and What Do They Mean For A Neuron?

A weight, usually denoted with a “w”, is the term that decides the amount of significance an input will have on the output of that neuron. The neuron adjusts the weight value using back propagation and gradient descent to fine-tune its output and improve the predictions the network makes. Because the input of the neurons are multiplied by the weight, it adjusts how much influence that input will have on the next layer and therefore on the output. Don’t worry, we’ll clear what back propagation and gradient descent are in a minute.

Now let’s deal with the bias. Take a look at the function inside the neuron in the image below and imagine it without the bias. With only the weight left, we can only adjust how steep the function is but we can not shift it to the left or to the right. The bias allows us to do that so that we can represent more complex inputs and get better prediction results.

What Do We Mean By “Model”

A model is, in the case of traditional machine learning algorithms, a function that has been adjusted to give the most accurate output it can. And that adjustment is made possible by passing the values in the training data one by one and comparing the functions output with the desired output in the dataset. Then adjusting the parameters of the function so that the output gets closer and closer to the actual output that we want to see.

In the case of Deep Learning, we can say that a model is the representation of a real-world problem as a function with the proper weights so that it can produce meaningful outputs.

Back Propagation, Gradient Descent and The Cost Function

Now if we’re going to improve the performance of our model, we’ll need some sort of feedback that will adjust the weight of the neurons so that we get closer to the optimal output.

The cost function or the loss function is the tool that lets us evaluate the performance of our model. It takes the prediction from the model and compares it to the output that we are aiming for. When we find the minimum value of the cost function using gradient descent, it means that we’ve most likely arrived at the optimal solution for our problem. Some of the most popular cost functions are Mean Squared Error, Mean Absolute Error and Binary Cross Entropy.

As for Back Propagation and Gradient Descent, it is like the engine of the whole system, without them there is no learning and improvement in the model. For a function to be applied to these algorithms, it has to fulfill two requirements. It has to be differentiable and it has to be convex. Gradient essentially means the slope of the function at a point and we get that by taking the derivative. Think of our predictions as a ball rolling down a hill. The gradient descent takes the gradient at the current position of the ball which is our prediction, multiplies it with the learning rate (a hyper parameter of the model), subtracts that from the ball’s current position and moves the ball in that direction by that amount. It keeps on doing this until the final value it gets starts increasing. If it keeps increasing for a set amount of steps, it stops. Now at this point, the ball’s final position could be the global minima or the local minima. That’s the tricky bit to get right.

Our hypothetical balls trying to converge to the global minima.

Let’s Wrap Up

If you’ve made it to the end of this not-so-well-structured post where I dumped whatever came to my mind on ML an DL, I really want to thank you! It’s probably been a messy and pretty long post but I do hope you will be leaving this page with at least a rough idea of how things work behind the scenes and some cool concepts/definitions that you can share with people. I do publish an article at around the same time each month but if you want to be notified when the next one comes out, you can just click here to get an email when it’s published. And if you’ve spotted any mistakes in the post, grammatical or technical, please feel free to reach out to me on Twitter or leave a comment here. Thank you again for reading and I’ll see you on the next post!