Maths Behind Machine Learning

Published in

The Startup

4 min readMay 12, 2020

Unboxing the Black Box

Opening a box sometimes gives you a feeling of excitement to want to describe things you find inside. Here, I describe some of the fundamental maths I found inside the black box that is machine learning. My aim is to share a bit of insight on the mathematical tools needed for machine learning but save the gory details for later.

Bayes´ Theorem

Bayes´ Theorem is extremely useful in analysing real life problems. It may seem like a simple multiplication of numbers but I think the tricky thing is to mathematically incorporate prior knowledge. Prior knowledge is the knowledge one already has before learning new information. A simple way to think about it is, when we have prior knowledge that fire is hot, we use this to affect chances on whether or not to touch a flame.

There is a host of material out there to learn Bayes´ Theorem. A reliable resource is Khan Academy as they show complex ideas in a simple, visual, and non-intimidating way.

Once you are familiar with the idea, it can be fun to use Bayes´Theorem when answering ordinary situational questions. For example, “How do you know if a girl who smiles at you, likes you?” Well, we can use Bayes´ Theorem and we will need the following 4 concepts.

the probability that the girl likes you : P(like)
the probability that the girl smiles at you: P(smile)
the probability that the girl smiles at you, given she likes you: P(smile|like)
the probability that the girl likes you, given she smiles at you: P(like|smile)

The question “How do you know if a girl who smiles at you, likes you?”, is represented by the fourth quantity, P(like|smile) and it is calculated with this ratio

Chances she likes you given she smiled at you. Image by author

The resulting number for P(like|smile) represents how likely she likes you given what you know of her frequency of smiling and probability of liking you. In the table below, Scenario A is when you have assigned a probability of her liking you at 20%, and you know the girl smiles frequently so you assign chance of smiling at 80%. Chances of her liking you, P(like|smile) is 25% which is not super. But if your prior knowledge is that she doesn’t smile often, i.e. P(smile) = 20%, then probably, she smiled at you given that she likes you. This is what we can see in the resulting P(like|smile) = 99%.

Furthermore, Bayes´ Theorem also explains a function typically used to decide between 2 different classes. I wrote about about it here.

Linear Algebra

If you have had linear algebra in school and you found linear transformations to be a bunch of mechanical operations, I would strongly recommend the set of videos done by 3Blue1Brown called The Essense of Linear Algebra. The videos show the ideas in a more intuitive sense. I especially liked the one on Eigenvalues and Eigenvectors as it made me remember my experience on making an undergraduate paper on matrices many many years ago.

Tip: if you find yourself stuck on how to do a matrix operation, there´s a Matrix Cookbook to go to.

(Mutivariate) Normal Distribution

The famous bell-shaped curve lends itself well to machine learning because of its properties such as being unimodal i.e. there is one single maximum. And in machine learning, we are frequently interested in maximising functions (i.e. likelihood). Here´s a nice illustration of unimodal from Intro to Descriptive Statistics.

Intro to Descriptive Statistics by Niklas Donges

Another property is that its probability density function has an exponential function. It becomes convenient when dealing with independence as an assumption. Independence, in this context, means that you assume that the probability of an event or variable does not affect the probability of another. Taking the probability of multiple independent events happening at the same time (i.e. jointly) becomes a multiplication of exponential functions. Furthermore, the normal distribution dovetails nicely with the Central Limit Theorem as taking a large number of samples from a dataset will be approximated by a normal distribution.

“Multivariate” is an important detail because when trying to predict an event, it is usually the case that we use more than one variable to explain it. Multiple variables are typically represented by vectors so proficiency at operations using vectors and matrices,i.e. linear algebra, is useful knowledge.

In a nutshell, for me, a good grasp of linear algebra, normal distribution and Bayes´ Theorem are great preliminary mathematical tools to take when going on a journey towards machine learning.

Maths Behind Machine Learning

Bayes´ Theorem

Linear Algebra

(Mutivariate) Normal Distribution

Written by Valerie Dela Cruz