Four pillars of Machine Learning #2 — Linear algebra and calculus

Sarthak Malik
CodeX
Published in
10 min readJan 15, 2022
Photo by Thomas T on Unsplash

In the earlier post, we have discussed the two pillars of machine learning “statistics” and “probability” required to clearly and easily grasp all concepts of ML. With this, we are good too for the “data preprocessing” and “data analytics” discussed in Getting Familiar to The World of Machine Learning. We will be going through the algebra and calculus required for the model training and evaluation.

3. Algebra

Machine learning is a lot about multi-variable equations and solving them. And guess what linear algebra is. Yes, linear algebra is the branch of mathematics that deals with linear equations like- y = mx + c and their representation in vector space (vector space means when a set of vectors along with scalar quantities are used to describe the point of interest). Linear algebra is a vast field and can be interpreted in various ways. Still, we will only study those essential things and jump over any unnecessary explanation here.

3.1 Scalars

Even if anyone is new to machine learning or has not heard of linear algebra or scalar. But should have already done basic mathematics, an operation like addition, multiplication, etc. Every operation includes a number that is manipulated using these operations; these numbers are said to be scalars. Like the cost of something or temperature or equation like y = 2*x + 3 are represented by scalar quantities, in the above equation, y, x, 2, and 3 are all scalars.

Note: In this whole series, we will be using small uncapitalized letters to represent scalar

3.2 Vectors and matrices

3.2.1 Vectors

Many may think that if we can represent everything in the form of scalars, why do we need vectors or matrices. Think about how we will represent a whole collection of scalars. Imagine a data containing houses having 150 or more features like the number of rooms, size, etc. Now, if we want to talk about a single instance of a house, it will be very tedious to write all 150 scalars to represent a house; here, vectors come to save. Think of vectors as a collection of scalars; it can be written as

Vector. Image source: Self-made

Here x₁, x₂, … xₙ are all scalars, and x is the vector representing their collection. The vector in the above image is called a column vector, and in most literature, a column vector is used

Note: In this whole series, vectors will be represented by bold uncapitalized letters like x

3.2.2 Matrices

Using a similar thought process, how would we represent a collection of vectors, a whole dataset, or a collection of several equations. This requires us to use something of two or more dimensions; with dimensions, we mean the dimensions of a matrix, like in the above example, the dimension vector is (n x 1).

Matrix representation. Image source: Self-made

Note: In this whole series, matrices will be represented by bold capitalized letters like A.

Note: Matrix can be of higher dimensions like (m x n x k). But here in this series and ML, we will be limiting ourselves to two-dimensional matrices only.

3.3 Vectors and matrix operations

3.3.1 Vector

a) Norms: In laymen’s terms, the norm of a vector tells us about the vector’s length. Earlier, we talked about the vector’s size; size here means dimensionality. This is not the same as the length of the vector. Norm or length ||a|| of a vector a is given by :

Norm of a vector. Image source: Self-made

b) Vector addition and subtraction: Only two equal length vectors can be subtracted or added. The new vector is calculated by adding or subtracting individual elements at the given index. For two given vectors, x, and y

Vector addition. Image source: Self-made

d) Dot Product: In simple words dot product of two vectors is given by the sum of the products of each element in the vector, i.e., the first element is multiplied with the first, and the second element is multiplied with the second, and so on. For two given vectors, x and y dot product is given by :

Dot product of two vectors. Image source: Self-made

3.3.2 Matrix

a) Transpose: Transpose of a matrix means a new matrix with flipped rows and columns. Superscript “T” i.e Aᵀ denotes it.

Transpose of a matrix. Image source: Self-made

b) Inversion: Inverse of a matrix is a matrix, when multiplied with the matrix, gives identity matrix ( Identity matrix means a matrix having all diagonal one and remaining elements zero denoted by I). The process of finding inverse is called inversion.

Matrix inversion. Image source: Self-made

c) Matrix-matrix multiplication: Matrix multiplication is one of the most important and used topics while solving or explaining any Machine learning models. We need to do the “dot product” of rows and columns for multiplying matrix. Below image clearly shows the process:

Multiplication of two matrices. Image source: Self-made

d)Determinant: Determinant can only be calculated for a square matrix( square matrices are that matrix that has the same number row and column). So, the Determinant of a square matrix denoted by det(A) or |A| is the volume of the box with side given by rows of A. It also tells us whether a square matrix can be inverted, i.e., if det(A) = 0, then the matrix can’t be inverted. You can read more about determinants here.

e) Rank: Rank of a matrix is the number of independent row vectors of the matrix. With independent, it means that the vector can’t be represented using vector addition or subtraction of other remaining row vectors. This is what will be required in machine learning, but if anyone is interested in knowing more in-depth, they can refer to this.

3.4 Eigenvectors and eigenvalues

Eigenvalues and eigenvectors are considered the most important topics of linear algebra. Finding eigenvalues and eigenvectors gives up a deeper insight into the properties of the matrix. It makes various operations like finding power and many more much easier.

Mathematically, for a matrix, A eigenvalue is denoted by a scalar λ and eigenvector as its name denotes is a vector v given by the equation:

                               Av = λv

Eigenvectors in Machine learning can reduce the number of features or dimensionality, which is one of the main steps of data preprocessing and can make model training faster and better. It is because eigenvectors can give us the important features of the data. To get a deeper insight refer to this.

4. Calculus

Have you ever wondered how the formulas for the area of various figures like squares, rectangles can be proved? Like me, I am sure there must be many who think that these formulas are trivial, but no, these can be proved with the help of calculus. Calculus is a vast field of mathematics used almost everywhere, ranging from calculating slope, finding the area of any shape and even in physics, and machine and deep learning, which we will cover later. And, those who want to go in-depth of mathematics related to models and how model training works must at least know the basics of calculus, which we will discuss here.

Note: Those wondering where we can use calculus or differentiation more precisely. Then, just know this we have to train our model through a slow process called optimization; this optimization is done with the help of differentiation.

4.1 Differentiation and Derivatives

In laymen’s terms, differentiation means to find the rate of change of any dependent variable ywith respect to any independent variable x. Dependent means that y can be written in terms of x example y = 2x or any other form. In simpler terms, we calculate speed as the rate of change of distance with respect to time. The operator denotes it

Differentiation of y wrt x. Image source: Self-made

It is also denoted by a small dash on the top of a function like f(x)’. Derivatives mean the differentiation of any particular function. There are a few essential derivates you will be requiring listed below:

Basic derivate. Image source: Self-made
Exponential derivate. Image source: Self-made
Product rule. Image source: Self-made
Quotient rule. Image source: Self-made

Note: If you cannot process these many formulas, look at them and bookmark this post for later use. Others who wanted a more profound understanding can refer to this.

4.2 Partial derivatives and gradient

4.2.1 Partial derivatives

Derivative tells us what change will happen in the dependent variable y if we change the independent variable x by a unit. This is true if yis only dependent on a single independent variable x, but if there are two or more independent variables and we only want to calculate the change with respect to a single one. For example, suppose we want to calculate the change in our knowledge with respect to the amount of time; this seems to be a problem to be solved by calculating derivate. Still, other factors are also affecting our knowledge other than time, like our concentration level, our source, and many more. So, here we require calculating the partial derivative of knowledge with respect to time. The symbol denotes it :

Partial derivative of y wrt x.Image source: Self-made

Let us assume a dependent variable y is dependent on two independent variable x and z and let the relation be y = 2x^2 + z, now we want to calculate the partial derivative of y w.r.t (with respect to) x .Also, an important point to note is that when calculating partial derivative w.r.t a variable, then all other variables are considered to be constant. It can be seen in the below figure:

Example of partial derivative. Image source: Self-made

4.2.2 Gradient

The gradient is simply a collection of partial derivatives w.r.t to all its independent variables. It is represented in the form of a vector. Given by :

The gradient of y wrt all independent variables. Image source: Self-made

For example y = 2x^2 + z in partial derivative the gradient is:

Example of the gradient. Image source: Self-made

4.3 Chain rule

Now, consider a situation in which variable y depends on variable x, which also depends on variable z. Now if we wanted to calculate the derivative of y w.r.t to z for this first, we have to convert the equation of y that is in the form of x to variable z , and in the case of large equations, this becomes a clumsy process. Here chain rules come to save us. According to chain rule:

Chain rule. Image source: Self-made

Conclusion

In today’s blog, we went through the two pillars of machine learning which are responsible for all the training and evaluation parts in the machine learning pipeline. We first went through the basics of vectors and matrices and why they are needed, then we got to know about their basic operation. At last in this section, we discussed briefly the rank, eigenvalues, and eigenvectors of a matrix.

In this section, we discussed derivatives in the easiest way possible and went through some of the most important and required derivatives. Then, we learned how can the derivative of a variable with more than one independent variable be calculated. After that, we discussed a way to represent all partial derivatives as a vector i.e gradient. At last, we discussed the most important chain rule.

With this, we completed the four pillars of machine learning and in the next blog, we will be learning about supervised learning in-depth. In this we will be learning not only the theory and intuition behind a few basic algorithms like linear regression, lasso and ridge regression, logistic regression, but we will also be learning the maths behind them in the easiest way possible.

If you liked our post please do follow me Sarthak Malik and my colleague Harshit Yadav on medium and do subscribe to our mailing list to get regular updates and remain with us in this journey.

Thank you,

Previous blog in the series: Four pillars of Machine Learning #1 — Statistics and Probability

Next blog in the series: Your guide to Supervised Machine Learning — Regression

--

--

Sarthak Malik
CodeX
Writer for

ML researcher || Artificial intelligence intern Mastercard || IIT Roorkee