Role of Mathematics in Machine Learning

Published in

Analytics Vidhya

6 min readMar 13, 2020

“A data scientist is someone who is better at statistics than any software engineer and better at software engineering than any statistician.” -Josh Wills

In today’s world, ask any techie and they shall tell you that the hottest jobs in the industry are all data and machine learning related. No wonder, this has inevitably caught a lot of interest among developers and companies that belong to the industry. A lot of resources and support has been made available on machine learning and data science, and statistical tools such as programming languages like R and libraries for various programming languages have been born out of the efforts.

Now, machine learning aspirants are many in number. Yet there is great demand for Machine Learning Engineers in the market, as there are not many who end up pursuing the job title. Part of the reason many do not pursue this field is the intimidating mathematics behing the job role. Although the popularisation of machine learning has given rise to many easy to use and widely supported libraries in Python and R. Libraries such as scikit-learn, TensorFlow and OpenCV has indeed made everyones lives easier by providing sort of a shortcut to machine learning, seemingly bypassing the mathematics behind the algorithms required for such operations. Yet the underlying principals of mathematics behind machine learning still remain pretty much the same.

Let us discuss the major sections of mathematics that are used in Machine Learning

There are four sections of mathematics that the concepts of machine learning heavily benefit from: Linear algebra, Calculus, Statistics and Probability. We shall roughly see why and when these branches of mathematics are applied during the life cycle of a machine learning model.

Linear Algebra

Machine learning is most heavily dependent on linear algebra, which is used to solve simultaneous linear equations. It is done with the use of matrices and matrix operations. The data for any machine learning model is generally stored in the form of vectors and matrices, and the values contained are considered as the coefficients of the linear equations.

Matrix operations are preferred as machine learning generally deals with large amounts of data, and thus it is easier to apply scalar operations such as scalar multiplications and divisions, and also operations between vectors through various matrix operations with great speed and ease.

The knowledge of linear algebra is important to decide how the data shall be stored in matrices. For example, a picture may be stored in three matrices, each element containing the intensity of the red, green and blue values of each pixel in the matrices respectively. Now doing operations on these pixels become very easy due to the use of matrices to apply linear algebra.

Calculus

Calculus is used to do help machine learning algorithms improve the accuracy of predictions it makes. This is done by the process of optimisation of algorithms. This is done with the help of differential calculus. We can find the extrema of a function by taking into account of its gradient using differential calculus. Multivariate calculus is used when there are multiple parameters of a function that determine the prediction by a machine learning model. It also helps neural network models, where differential calculus is used to compute back propagated error.

Furthermore, integral calculus is also used to calculate the loss function in deep learning models, and also to draw the expectation of a particular variable in a probability distribution of continuous values.

For example, the classic gradient descent problem to optimise and find out the lowest position a ball will roll to in a bowl. This is solved by simple differential calculus.

Probability

Probability is used to make decisions when there is no conclusive outcome of an algorithm but a probability distribution. An algorithm my output a range of values and their probability that these values shall be expected or true. This is where probability comes in, and a decision is made based on the probability of the expectation of a variable. No algorithm can give an output which is completely and blindly reliable. Hence probability is used to decide the outcome of the gray area.

For example if we find out the number of people affected by Parkinson’s disease and their ages on a sample, we shall receive a probability distribution of the age of a person affected by Parkinson’s disease. Now if we are required to select the age affected the highest we can take an age range where it is most probable to be affected by the disease. This decision making process from a continuous distribution requires probability.

Statistics

Statistics is used to draw conclusions from a data. Various statistical methods may be applied to a data to draw different conclusions, and hence get a better understanding of the data. Such understandings may be to find out the mean value, the extremes, range of data, or more complex conclusions such as cheating for outliers in a data, the degree of a function given by the data, coefficient of correlation between various parameters of the data, correlation of a parameter and the expected output of the algorithm and so on. There are also hypotheses tests such as the chi square test, z test, p test, ANOVA et cetera, which test for the validity of a hypotheses that we may assume and test it on the given data.

Let us now consider an easy to understanding example which would roughly include all these branches of mathematics so we may get an idea of their implementation
A classic example would be that of a face recognition algorithm that relies of machine learning.

The pixels of the images in the example data set are stored in matrices in the form of vectors. This utilises Linear Algebra. If colour images are taken then a number of matrices according to the colour scheme used is utilised and the intensities of each pixel is stored in those vectors. This makes handling the data easy, and facilitates vector operations on them, which here would be to compare an existing face with that in the given picture.
Calculus would be used here to take care of the gradient of the error present. The gradient of error between the definition of a face (stored as a vector) and the given picture is drawn. If the gradient exceeds the tolerance limit, then the definition of the face is updated by updating the coefficients of the vector it is stored in.
Probability is used to decide if a given picture has a face in it by calculating the probability of a face to exist in the given picture. No algorithm can give an output which is absolutely correct and 100% reliable. Hence probability shall be used to decide the outcome. Probability compares the input with the tolerance factor of the algorithm.
Statistics is used throughout various processes that the algorithm goes through, such as calculating the correlation between the various parameters of the images and the desired outcome. Statistics is also used to test the hypotheses that a face exists in the given images and the tests reveal if the null hypotheses or the alternative hypotheses is accepted.

Hence we can see that every bit of machine learning is heavily dependant on mathematics. This is why Machine Learning engineers need to have a strong grasp on the above mentioned sections of mathematics.