Beginner’s Guide to the Maths behind Machine Learning
Kickstart your ML journey — Essentials of Linear Algebra, Calculus and Statistics for success in ML
To gain a deep understanding of how Machine Learning works, it’s important to understand the underlying maths powering some of these ML algorithms.
The purpose of this article is to get you acquainted with some of the most important and fundamental mathematical concepts you need, to succeed in ML.
You don’t need an ML knowledge to enjoy and understand this article. If you love learning and want to understand the essential maths powering ML, this is the perfect article for you.
The Maths behind Machine Learning
It’s often said that you need to know 3 types of maths for ML:
- Linear Algebra: Linear Algebra defines a system for arranging numerical data and performing computations between them. You may have heard of words like “vector”, “matrix” or “dot-product”.
- Calculus: I know it sounds complicated, but calculus simply studies how changes in one variable affect changes in another. For example, how does a change in weather affect the number of ice cream sales? Words such as “derivative”, “slope”, and “gradient” are all used to describe the same idea: change.
- Statistics: Statistics refers to how data is prepared, described and analysed. You may have heard of terms such as “standard deviation”, “distribution” or “Gaussian”.
My goal is that you walk away from this article feeling comfortable with the main ideas behind these topics.
Let’s begin.
Linear Algebra: Arranging data
- Vectors
- Dot Products
- Matrices
- Matrix multiplication
- Transposing a matrix
Vector
A vector is a list of numbers. For example, I can use a vector to record ice cream sales for the past week.
Given two vectors of the same shape, we can add, subtract, and multiply these vectors.
Dot Product
We also have some special operations for vectors. You may have heard of the “dot-product”. The dot-product takes two equal-length vectors, multiplies their corresponding elements, and sums it up.
The dot-product has a range of uses, but it’s usually used in some sort of similarity computation — the more similar two vectors are, the higher their dot product (See cosine similarity).
Matrix
A matrix is a table of numbers.
My ice-cream shop is now selling 3 flavours! Chocolate, strawberry and vanilla. I can use a matrix to record ice cream sales for the past week, for each of these flavours.
The above matrix has a shape of (3, 7) because it has 3 rows and 7 columns. The numbers ‘3’ and ‘7 are called dimensions.
- A matrix has 2 dimensions.
- A vector has 1 dimension.
- A single number (called a scalar) is said to have 0 dimensions.
We can add and subract matrices of the same shape, although this operation is used scarcely in ML.
Matrix Multiplication
There’s a special (and extremely useful) operation between matrices called “matrix multiplication”. It sounds complicated, but it’s really a bunch of dot products.
The only rules for matrix multiplication are:
- Matrix multiplication is between two matrices.
- If the first matrix has shape (a, b) then the shape of the second matrix must be (b, c). In other words, the number of columns in the first matrix must equal the number of rows in the second matrix.
- The matrix which is produced will have shape (a, c).
Take a moment to understand the diagram above. I’ve only shown the calculations for two of the result entries, can you figure out the rest?
Matrix multiplication is arguably the most foundational operation in all of machine learning, and especially deep learning.
Transposing a matrix
When you transpose a matrix, the rows and columns switch. This has the affect of transforming a matrix with shape (a, b) into a matrix with shape (b, a).
Learn More
Hoorah! Now you know about vectors, dot-products, matrices, matrix multiplication and matrix transposition, which are the terms you’ll most often hear about in ML.
To boost your knowledge, I highly recommend the following resources to learn more about linear algebra:
- Khan Academy’s linear algebra course
- 3Blue1Brown’s “Essence of Linear Algebra”, if you crave geometric intution.
Calculus: Measuring change
- Slope
- Derivative
- Minimizing a cost function
- Commonly Used Symbols
Slope
I want to know how temperature affects my shop’s ice-cream sales. My intelligent advisor has generated the following graph:
Changing a variable (e.g. temperature) will often change another variable (e.g. number of ice cream sales) — slope measures the strength of this change.
Another word for slope is “gradient” — they mean exactly the same thing.
Derivative
Whereas for a straight line, the slope is constant throughout the graph, for curved functions, the slope is constantly changing.
That is, the rate of change itself, is changing.
For the above plot, the function graphed in blue is f(x)=x². The derivative of this function happens to be 2x. That means we can calculate the slope of f(x) for any x. The slope at x=0 is 0, the slope at x=4 is 8, and the slope at x=-3 is -6.
In a nutshell:
The derivative tells you the slope at any point in time.
Minimizing a cost function
Arguably the most important application of calculus in ML is minimizing the cost function. It sounds scary, but it’s a straightforward concept.
When we want to minimize a function, we want to find the lowest point of that function, called the minimum point. We are usually given a starting point, and we want to iteratively approach the minimum point.
Given a function and a starting point, we can use the derivative (slope) of the function to gradually approach the minimum point.
Intuition:
- If the derivative is positive (we are on the right of the minimum), this means that if we increase x, we will get a higher value of y, which is the opposite of what we want. So we decrease x (jump left).
- In the same manner, if the derivative is negative (we are on the left of the minimum), this means that if we increase x, we will get a lower value of y, which is what we want. So we increase x (jump right).
That’s how we minimize a function.
But what’s a cost function?
A cost function simply tells us how bad an ML algorithm is performing. A good ML algorithm will produce a low cost, while a bad ML algorithm will produce a high cost. It follows that minimizing the cost function will help us find a good ML algorithm.
Another term that’s thrown around in this context is optimization. In simple terms, when someone says they are “optimizing an ML algorithm”, it just means that they are minimizing the cost function.
Commonly Used Symbols
Calculus uses a lot of special symbols, which can often be confusing. Here I want to show the most commonly used symbols, and provide a simple definition for each.
Note: the second and third symbols do have some nuanced meaning, but they still convey the same idea as the derivative. Whenever you see these symbols, just think “derivative” or “slope”.
Learn More
Congrats! Now you’re familar with slope, gradient and derivative (they all mean the same thing), and the idea of minimizing a cost function.
That being said, there’s so much to learn to further your understanding of calculus. Here are some amazing resources:
- Khan Academy’s Differential Calculus course (highly recommended)
- 3Blue1Brown’s “Essense of Calculus” (highly recommended)
- Khan Academy’s Multivariable Calculus course, specifically the section on derivatives of multivariable functions
- My personal tutorials on basic calculus including derivatives
Statistics
- Mean and Median
- Standard deviation
- Distribution
- Gaussian Distribution
Mean
The mean, or the average, is found by summing together a group of values and then dividing by the number of values.
In fancy maths notation, the mean is calculated as:
Median
The median is the “middle” value. For calculating the median, the data must be sorted.
In this example the mean and median we’re equal (166), however this is usually not the case.
Standard deviation
Using the mean, we can calculate the average height given a list of heights.
Standard deviation is a measure of how spread out or dispersed a set of values is. In simpler terms, it tells you how much individual numbers in a group differ from the average (mean) of that group.
The standard deviation of 23 tell us that, on average, people are 23 cm shorter or taller than the average height.
The standard deviation also has an equivalent mathematical formula:
Take a moment to make sure this formula makes sense. It’s not that important to remember the formula, but you should understand the information that standard deviation conveys: it measures the spread or dispersion of a group of values.
Distribution
We have been using the term “group of values” to describe, well, a group of values.
The mathematical term for a group of values is distribution.
Up until now, we’ve been viewing a distribution as a vector of numbers, but there’s a really nice way of visualizing these numbers, called a histogram.
Distributions come in all shapes and sizes. Some have a single peak, some have two peaks, and some have no peaks. Some have long tails, some are wide, and some are really narrow.
Here are some more distributions to feast upon:
Remember, a distribution is just a group of values.
Gaussian Distribution
Arguably the most famous distribution is the Gaussian Distribution a.k.a the Normal Distribution a.k.a the Standard Distribution (so many names!).
Informally, a Gaussian Distribution is any group of values where:
- The values are symmetric about the mean.
- Most values are around the mean.
These conditions can be easily checked through a histogram:
Gaussian distributions are famous because they appear frequently in the real world. The following real-world variables all have normal distributions:
- Height
- IQ
- Birth weight
- SAT scores
- Reaction times
Learn More
Bravo! Now you know what the mean, median and standard deviation are. You’re also familar with what a distribution is, and you know about the famous Gaussian distribution.
I’ve covered the most essential topics of Statistics for ML. However, as always, there’s a lot more you can learn to strengthen your understanding. Here are some fantastic resources:
Summary
Phew, you made it! Give yourself a pat on the back for coming this far!
In this article we discussed the basics of:
- Linear Algebra: Vectors are just lists of numbers, matrices are tables, the dot product converts two vectors into a single number, matrix multiplication is just a bunch of dot products, and matrix transposition is just a good ol’ switcheroo!
- Calculus: Slope, derivative, and gradient all mean the same thing: they define the rate of change of one variable with respect to another. To minimize a function you use the derivative, and a cost function tells us how bad an ML algorithm is doing — we want to minimize this function.
- Statistics: In statistics, a group of values is called a distribution. Given a distribution, you know how to calculate its mean (average), median (middle) and standard deviation (spread). You also know about the famous Gaussian/normal/standard distribution, and that it appears often in the real world.
Conclusion
🎉 Well done dear learner! You’ve made a lot of progress! You should be proud of yourself!
Next Steps
Now that you know the basics, I highly recommend you strengthen your understanding of these topics by going through the additional resources I provided throughout the article.
Anything worthwhile takes time, so don’t be disheartened if you don’t understand a concept right away. Just keep going at it, do some more practice problems, and eventually you’ll have all the mathematical tools you need to succeed in ML.
P.S. I highly recommend Khan Academy for practising maths, it’s really great!
Thanks for reading my first article!
I hope you enjoyed it and gained valuable knowledge from it.
I’ll be posting once a week from now on, so if you enjoyed this article and want to see more, follow me on Medium.
If you have any questions, comments, or constructive feedback, drop me an email,
raj.pulapakura@gmail.com
Until next time, happy learning! 🤩