An Introduction to Artificial Intelligence

Published in

The Startup

16 min readSep 26, 2020

If you regularly use the internet, you interact with artificial intelligence every day. There’s no elf hiding inside your Alexa that knows what to do when you ask it to play that new BTS song, there’s no warehouse full of people who look at your search history and choose ads to send you, and the Spirit of Youtube Past doesn’t magically recommend the next video you’ll see. Artificial intelligence isn’t limited to The Terminator, Cleverbot, and that computer-generated Harry Potter chapter that was still better than Cursed Child.

But for all the buzz artificial intelligence gets, and how ingrained it is into our everyday lives, most people know very little about what it is, how it works, and why we should trust that it isn’t secretly plotting to impurify our precious bodily fluids. So, here is a no-prior-knowledge, light-on-math, stop-at-any-time-and-skip-what-you-know introduction to artificial intelligence to help you better understand its strengths and limitations, and, more importantly, look smart in front of your friends.

Level One: What is Artificial Intelligence, exactly?

Great question! I’m glad you asked. Simply put, artificial intelligence is an algorithm designed to solve a problem. That’s it. That problem can be virtually anything that would typically require human intelligence — from finding a face in a picture to figuring out whether or not an email is spam.

A word you might have heard tossed around is “machine learning.” This refers to how AI “learns” to do its task, and there are a few ways it could work. To start, let’s talk about the biggest and most basic distinction in machine learning methods: supervised vs. unsupervised learning.

Imagine, for a second, that you’re a preschool teacher. Your job: to teach a group of well behaved and receptive (but not terribly bright) three-year-olds a few shapes: square, circle, and triangle. So you’ve got a bunch of little paper cut-outs of these shapes and a strong cup of coffee, and now you have to decide exactly how you’re going to transmit this information to a group of toddlers.

Method 1: Supervised Learning

One option would be to show them a bunch of examples and identify each of them. “That desk is a square. The clock is a circle. This tortilla chip is a triangle.” And if you do this for long enough, they’re going to start to notice patterns, and when the time comes for them to be thrust out into the real world of unlabeled shapes, they’re going to be able to tell a square from a circle. This is supervised learning — you give a toddler (or AI) a bunch of example inputs (the shape itself) and you tell it what the output is for each one (the name of the shape). The AI figures out the pattern and can apply it.

Method 2: Unsupervised Learning

But say you’re lazy, or need to teach more shapes than you have time to give examples of, or maybe you don’t fully understand shapes yet yourself. You could give all the paper cut-outs to the children, and just tell them to figure it out and put the shapes into piles. Even though they don’t know what any of these shapes are called or even that the goal is to divide shapes (they could just as easily go off of paper thickness or color), they’ll figure out some way of dividing them up based on their similarities and differences. This is unsupervised learning — where AI looks for patterns without being given labels.

Method 3: Semi-Supervised Learning

There are a few other types of machine learning that don’t quite fall into these categories. There is, for example, semi-supervised learning, which takes little bits of both supervised and unsupervised learning in a “best of both worlds” approach that Hannah Montana would approve of. This happy medium would give the toddler a few categorized shapes to go off of but would leave most of the dividing up to their innate ability to spot patterns.

Method 4: Reinforcement Learning

There is also reinforcement learning, which is considered one of the Big Three with supervised and unsupervised. Reinforcement learning is where the AI takes random actions and is rewarded for doing what it’s supposed to. This is helpful when you’re training AI to play a game; if you train it by showing it what human players do, it’s never going to be better than the best human it learned from, but if you reward it for winning (or taking actions that will lead to an eventual victory), it will naturally gravitate towards best practices and improve over time.

Level Two: What are the applications of these different techniques?

Every machine learning technique has different strengths and weaknesses, and which one you use really comes down to what you’re using it for. Let’s start with supervised learning. To understand when supervised learning will yield the best results, let’s break it down further. There are two main types of supervised learning models: classification and regression.

Classification

Classification techniques do what it sounds like they do — classify data into set groups. This is our toddlers-learning-shapes example. Input w, lemon, and t-shirt, and presto-change-o, you get letter, fruit, and clothing (assuming those were the categories you provided and you gave the AI lots of examples in each category).

Regression

Regression techniques, on the other hand, predict future data based on past data. If you got bored of teaching pre-schoolers shapes and got a job teaching an eleventh grade English class, you could use a regression algorithm to predict your students’ future grades based on their past ones — without ever having to read a single essay on The Great Gatsby.

Here are some sample applications of classification:

Snapchat discerning that an image includes a face so that it can make you into an anime character
When you identify all the stop signs in a reCAPTCHA, you’re training AI to do the same by labeling the data
A video platform that could identify offensive gestures
Alexa, Google Home, and Siri understanding what you’re saying (as well as speech-to-text keyboards)
Gmail (or Yahoo mail, or Outlook, or… AOL?) identifying emails as spam
Grammarly telling you that your email sounds aggressive

On the other hand, here are some situations when regression would be helpful:

Predicting if (and when) a patient will have a heart attack based on their medical history
Predicting tomorrow’s weather
Predicting how many TikToks your twelve-year-old sister will watch in one sitting

Unsupervised Learning:

Unsupervised learning, on the other hand, relies not on classification or regression, but on clustering — detecting similarities or differences in order to group data without labels.

Here are some example applications of unsupervised learning:

Lumping Amazon users who buy diapers, pacifiers, and onesies into one group
Noticing that your Youtube subscribers from Canada watch your ASMR content, but users from the United States prefer slime videos

These patterns might seem obvious, but when each new parent, for example, is also buying office supplies and Stranger Things merch and hand sanitizer, it is incredibly helpful to have a way of grouping people together and offering them recommendations without having to create labeled categories that can encapsulate everyone on the internet. In many cases, an unsupervised learning algorithm can find patterns that a human just wouldn’t see.

Last but by no means least, here’s where you would see reinforcement learning put to use:

Self-driving cars might be trained with reinforcement learning
AI that sets the prices of products are trained by reinforcement learning to set the highest price at which people will buy the product
Facebook uses reinforcement learning to personalize suggestions and notifications to a particular user
The AI AlphaGo Zero learned to play the game of Go better than the best humans without any labeled data about the rules or strategies or human techniques — just reinforcement learning

These are only a few examples of each type of machine learning, but hopefully, you can use this labeled data to draw some conclusions about other examples of AI for yourself. What do you think Instagram uses to recommend videos? How do you think AI that’s designed to give optimal bids at auctions are trained?

Level Three: Okay, that’s simple enough… but to understand how they work, I would need to understand some heavy-duty calculus, right?

Nope! In fact, you can grip every method of machine learning with almost no math at all, at least to a certain extent. Of course, if you wanted to go make AI you’d probably need a solid understanding of linear algebra, vectors, and multi-variable calculus, but in envisioning how AI works, the math isn’t the problem. However, things are about to get a little confusing, so buckle your seatbelts.

Neural Networks

You may have heard the term “neural network.” This term is fairly self-explanatory: a computer system modeled on the human brain and nervous system; a network that simulates the behavior and activity of neurons. Strictly speaking, neural networks are a supervised learning technique, so they’re not universal, but that’s what we’re going to focus on for right now.

The bottom image shows what goes on in your brain that allows you to identify a fuzzy, adorable creature. Light bounces off of the creature in question and into your eyes, and then your neurons do… something. What do they do? Who knows? We understand the parts of a neuron in general, and how they interact with other neurons, but exactly what turns cat-light into cat-image into cat-thought deserves an article of its own.

Neural networks follow a similar pattern, though. You put in your input — raw data, you do something to it, and it becomes an output. Now, since this is supervised learning, we have some examples of how other things have been classified, but we don’t exactly have rules for classification.

Say you walked into a math classroom and saw the following problem on the board.

f(2) = 4

For the math-averse, that translates to an input of two and an output of four. So, okay, cool. We know what happens when we put in a two. But what is f(3)?

Well, to do that, we need to know the function — we need to know exactly what changed that two into a four.

2 + 2 = 4,

so at first, we might decide that

f(3) = 3 + 2
f(3) = 5

But

2 x 2 = 4

is also true. So

f(3) = 3 x 2
f(3) = 6

But doesn’t 2² = 4? And doesn’t (2 + 1,896)/474.5 = 4? So how on earth does our hypothetical sadistic math teacher expect us to figure out f(3)?

Well, don’t panic, because we have a few things going for us. First of all, we have a lot of data. Like, a lot a lot. So our AI isn’t exactly flying blind. But we’re also looking at a really complicated function here. We’re dealing with text, or speech, or image recognition, not an eighth-grade math problem. So what’s a simple AI to do?

Say we’re setting up AI that’s designed to listen to music and tell us whether or not it’s a bop. We have two outputs: “bop” and “not_bop”. And we’ve given the AI a bunch of sample data with labeled inputs and outputs. “Bohemian Rhapsody”? Bop. “Baby”? Not_bop. So now our AI’s task is to listen to music, and tell us: bop or not?

Let’s examine this sample neural network. It’s cute and small and will let us envision our much more complex neural network. We start with input. Here, we see room for two inputs, maybe “genre” and “tempo”, but we could have as many as we like, such as “key signature”, “time signature”, “timbre”, etc.

Once we enter our inputs, they move through the hidden layer, and into the output layer, where we get our results. AI can also have multiple hidden layers. (By the way, neural networks with more than one hidden layer are said to be using deep learning, since there are many transformations happening from input to output.)

So what goes on in those hidden layers, anyway? The simple answer is that the function that will determine the song’s category is applied. But at first, that function isn’t what we want it to be. You see, AI isn’t actually a toddler. It isn’t a person that’s going to get a sense of your taste in music and use that to predict what you want to listen to. So the AI’s going to do the best it can — which isn’t very good. In fact, it’s basically going to divide songs up at random. But our AI is receptive and wants to improve, so we can employ a handy technique called a “cost function.”

Cost Functions (Part 1)

A cost function involves a lot of math, which we’ll get into in Level Four, but here's the basic concept: a cost function is going to compare the output that we get to the output that we wanted to get. So say that our algorithm has declared that “Gangnam Style” is not_bop, when obviously, it’s a bop. Upon closer inspection, we see that the total bop-ness, a number generated by the last hidden layer (more on how that works in Level Four), is much lower than it should be. Our cost function looks at exactly how far off that number is, and which of our “nodes” made it go wrong.

Nodes

What is a node? Each little circle. A node can hold an input (raw data), a transformation that is performed on that data in the hidden layer, or an output. So if our AI can figure out how far off it was, and which nodes are causing the problems, it can improve to do better next time, until, eventually, we can say with close to 100% certainty whether or not a given song is a bop.

Level Four: Light math, as promised

For those of you who want a slightly deeper understanding of what’s actually going on in a neural network, I’m going to take it just one step further. No need to break out your old math textbooks; this is just a slightly more complex look at a neural network’s inner workings.

First of all, the ultimate goal of our network is to produce outputs between one and zero for our two categories: “bop” and “not_bop”. Each of these categories has an activation threshold. Let’s say each song needs a value of at least .75 to be considered a bop — anything less, and it’s not_bop. A higher threshold is going to mean more false negatives, while a lower threshold will lead to false positives, but we don’t want to risk having some not_bops sneak onto our playlist. The way we derive this number is going to look something like this:

Y_hat = b_1*X_1 + b_2*X_2 + b_3*X_3 + a

This looks kind of terrible, but you might remember a little formula that you learned in middle school that looks like this:

y = mx + b

This is slope-intercept form and — surprise — it does have real-world applications! Now, thinking back to Algebra I, x and y represent your horizontal and vertical value on the coordinate plane respectively, m represents the slope of the line (the rate of vertical change with respect to horizontal change), and b represents the y-value when x=0; the y-intercept.

In other words: x is the input, and y is the output. To turn x into y, we multiply by the slope of our line (the rate at which output increases or decreases as input increases) and add the y-intercept — the output when our input is zero.

Y_hat = b_1*X_1 + b_2*X_2 + b_3*X_3 + a

Now, this formula looks much more complicated. But in reality, it’s just slope-intercept form in disguise! X_1, X_2, and so forth are our inputs — we have one for each piece of data that we put into the network. The b values are our slopes — the rate of change of each output with respect to that input (also known as “weights). Finally, a is the y-intercept, also known as the “bias” — a constant that we add to make our function work better.

The “Squishing Function”

However, you may recall that we’re trying to get an output between 0 and 1. To do this, we need a “squishing function.” There are a couple of ways of doing this, but let’s talk about just one: logistic regression. Let’s look at a formula that you might remember from pre-calculus:

e is Euler’s Number, an irrational constant that approximates to 2.71828. A negative exponent, when applied to a positive number such as e, will make the number very small (but never zero or less) when x is positive, and very large when x is negative. So if we work on the assumption that the boppier the song, the greater the Y_hat, let’s examine what our output will look like:

Our logistic regression, which is our very last node, takes our Y_hat (the output of the previous node layer) and plugs it into x. If we look at a bop [your favorite song here], we’re going to have a very large Y_hat. This is completely arbitrary since this is completely hypothetical, but let’s presume it’s 9,001. e to the negative 9,001 is going to be very, very small — almost zero. That means our function is going to be very close to one divided by one — just slightly below one. Since that’s over the bop threshold of 0.75, your favorite song is, definitively, a bop!

However, if we examine your least favorite song, we’re going to get a very small Y_hat — we’ll call it -9,001, for balance. Because of the double negative, we’re going to end up dividing one by one plus e to the positive 9,001, which is huge. 1 divided by a very, very large number will be very close to zero — therefore, the song in question is not_bop.

But, of course, our AI isn’t going to assign high Y_hats to bops and low Y_hats to not_bops at first. It’s going to be random. So how can we get our AI to do better?

Cost Functions (Part 2)

The answer, as we saw earlier, is a cost function. But how does that actually work? It’s easier than it looks. Basically, you square the difference of the output you got and the correct output (1 for bops, 0 for not_bops). So if you wanted a 1 and got a .06, you would do (.06–1)² = 0.8836. This is your cost. The better your AI gets, the smaller your cost will be, as your margin of error shrinks — but a not-so-great function will have bigger costs, because the margin of error is larger.

Because we have many, many weights and biases creating new outputs throughout each layer of our function, our cost function does this square-the-difference procedure on every single one — and then adds them all up. So if our goal is to minimize the cost, we have to find the inputs — that is, the weights and biases — that will give us the minimum cost value. If you’ve taken some calculus, you’ll probably remember that we’ll have a minimum where the derivative of our cost function — that is, the instantaneous rate of change of the cost relative to a given input — is equal to zero.

But when our cost function is really complex, explicit differentiation isn’t always an option. However, the slope of the tangent line to a particular point will help you understand whether to raise or lower the input — if the slope is positive, lowering the input will also lower the cost, whereas a negative slope will mean that you must raise the input to lower the cost.

Here, a Babylonian Square-Root Algorithm-type method is probably more helpful than calculus; that is, guess, check, and revise. Check your slope, move in the necessary direction, and assess whether you’ve gone too far until your slope finally equals zero. Of course, a function can have multiple relative minimums, so there’s a good chance that the weights and biases you’ll obtain won’t give you the lowest possible cost. However, this system is good enough to give you a pretty accurate model.

So, to recap:

Level One

Artificial intelligence is an algorithm designed to solve a problem
AI can be trained to solve that problem through a process called machine learning
There are three main types of machine learning: supervised, unsupervised, and reinforcement learning
In supervised learning, AI uses labeled data to learn how to put similar unlabeled data into set categories
In unsupervised learning, AI looks for patterns and puts unlabeled data into its own groups
In reinforcement learning, AI initially behaves randomly, and is rewarded when it randomly opts to behave “correctly,” leading it to get better and better at its job
In semi-supervised learning, AI is given some labeled data and a lot of unlabeled data, offering the extra help of supervised learning without the hassle of labeling everything

Level Two

There are two main supervised learning techniques: classification and regression
Classification organizes data into set groups
Regression predicts future data based on past data
Unsupervised learning uses a technique called clustering
Clustering detects similarities or differences to group data without labels

Level Three

Neural networks are a type of classification model
They send inputs through one or many hidden layers to create an output
Deep learning is when a neural network has more than one hidden layer
Cost functions help neural networks “understand” where they went wrong so they can improve
Nodes are small pieces of neural network that perform tasks, such as holding or transforming input, and are organized in layers

Level Four

The output of a neural network is determined by whether or not the input, when transformed by the hidden layer(s), meets the activation threshold of a given output node
Input is transformed by weights (slopes) and biases (y-intercepts) in an equation best described as slope-intercept-form-from-hell
Cost functions determine how “wrong” an output is by comparing the actual output to the desired one
“Wrongness” is measured in cost
Cost functions also determine the weights and biases that will achieve the lowest “cost”

Conclusion

Congratulations, you made it! Artificial intelligence can be difficult to understand at first, and more often than not, it’s shrouded in mystery on purpose — do you really think Facebook wants everyone knowing its inner workings? However, hopefully you now know enough to impress your friends, and are inspired to go learn more!

Neural networks are just a tiny example of the many types of artificial intelligence, and the first one was created in 1958. Artificial intelligence is more than a sixty-year legacy, and you could be the person who takes the next big step. When it comes to artificial intelligence, the future is wide open — so go out there and learn about it, and try not to create anything that will destroy humanity as we know it!