Four pillars of Machine Learning#1 — Statistics and Probability

Sarthak Malik
CodeX
Published in
10 min readJan 8, 2022
Photo by Uriel SC on Unsplash

Hola, Machine learning lovers!

This blog is the second post of the “Complete Machine Learning and Deep Learning for Beginners” series, and it will focus on making us ready for the journey ahead. This will lay the foundation stone of mathematics and statistics required to learn machine learning.

Just think about will whether it be fair if a class 5th child is forced to learn about the theory of relativity and mathematics behind it because he finds the black hole amazing. Hell no……… This is what most machine learning enthusiasts are doing these days. Jumping directly into training model and copy-pasting code may seem incredible in the short run, but this can be dangerous if you are looking for a career in ML. I suffered from the same problem as many other excited and enthusiastic people. We decided that we should start with a strong base of mathematics and statistics before entering the fantastic world of ML. This post and its next part will cover the following preliminary topics required to start this tremendous machine learning journey.

Note: You may not understand these formulas now or may not be able to comprehend why are they important. But believe me, you will be needing them and may return to this post anytime in between this series to have a look. Keep this post and its part 2 as your formula guide.

  1. Statistics
    1.1 Measure of central tendency.
    1.2 Measure of dispersion.
    1.3 Similarity measure.
  2. Probability
    2.1 Random variable and Probability Distributions.
    2.2 Joint probability.
    2.3 Conditional probability.
    2.4 Bayes’ theorem.
  3. Linear algebra
    3.1 Scalars.
    3.2 Vectors and matrices.
    3.3 Vectors and matrix operations.
    3.4 Eigenvectors and eigenvalues.
  4. Calculus
    4.1 Differentiation and Derivatives.
    4.2 Partial derivatives and gradient.
    4.3 Chain rule.

1. Statistics

If we look on any mind map related to Statistics we will be made familiar with how vast this field is, but do you think is it really necessary to learn about all these topics? Yes, for sure but as a beginner, we will only focus on essential topics and will be needed for learning and explaining ML. And, as we go forward we will be made familiar with new topics, but mark my words all these topics can be easily understood with what we are going to discuss now.

Statistics is a field of mathematics that helps in giving to our data. It tells us about how the data is distributed and how diverse the data is ? What are the insights we can make from data, i.e., either the data is supporting a hypothesis or not and what all models of machine learning we can apply and how should we proceed.

1.1 Measure of central tendency

It can be thought of as a single value defining or representing the center of data distribution.
Mean, mode, median are the three main measures of central tendency. Many of you must have already heard of these terms, but let us briefly review what they are and how they are calculated.

Mean
The mean of a data set is defined as the sum of all values divided by the number of values in the set. It is also referred to as arithmetic average and represented by the Greek letter μ (“Mu”) in the case of the population’s mean and x̄ in the case of the sample’s mean.

Note: Here population means the entire dataset about which we want some conclusion, and sample means a part of that population.

The formula for mean. Image source: Self-made

Note: The problem with mean is that it is affected by very large or small values, and the answer may be distorted.

Median
Median can be defined as the “middle element” of the whole data set when arranged in descending or increasing order.

In the case of an odd number of elements, the middlemost element is the median of the whole set. While in the case of an even number of elements, the average of two central elements is the median.

Mode
It is the most frequently occurring element of the dataset. There can be one mode or multiple modes. It can also be used when the data is not numerical.

1.2 Measure of dispersion

The measure of central tendency in the above section is insufficient to describe the data set as two datasets can have the same mean but are entirely different. Thus, we need another factor to measure the data set’s variability. Above can be calculated by range, interquartile range, standard deviation, and variance.

Range

The range of a set of data is the difference between the maximum and minimum values of the set.

Range = max(X) - min(X), Where X is the set of data.

Quartiles and IQR
The median divides the whole dataset into two parts, while quartiles divide the data set into four parts, and the divisions are called Q1, Q2, and Q3.
Here,

  • Q1 is the element that divides the dataset such that 25% of elements are smaller than Q1 and 75% are larger. Also called lower quartile.
  • Q2 is the same as median, i.e., divides the data 50–50%.
  • Q3 divides the dataset such that 75% is smaller than Q3 and 25% is larger. Also called as upper quartile.

Interquartile range(IQR) is another measure that can measure data spread. This is the difference between the upper and lower quartile(Q3-Q1). This is very important as this removes the effect of outliers (the abnormal data at the end and start of data that can affect its spread and measure).

Interquartile range = Q3 — Q1.

Note: While calculating Q1 and Q3, if the number of elements is even, then the average of two central elements is taken.

Note: Minimum, Q1, Median, Q3, Maximum these five together makes the five-point summary.

The five-point summary. Image source: Self-made

Standard Deviation and Variance

Standard deviation and variance are used to measure the distance of each data point from the mean. Standard deviation is the square root of the variance.

Variance and SD for population and sample. Image source: Self-made

Note: It is important to note that there are two different formulas for population and sample variance and standard deviation.

1.3 Similarity measure

In ML, we always require measuring how close two data points are or how similar two points are. Surely you must have already heard about euclidean distance; this is one of the ways of measuring distance or similarity between points. Here we will be studying Minkowski distance, Euclidean distance, Manhattan distance, Cosine similarity, and Correlation.

The most commonly used measure of dissimilarities of two data points described by numeric values or attributes are Minkowski distance, Euclidean distance, and Manhattan distance given by the formulas.

Euclidean distance, Manhattan distance, Minkowski distance. Image source: Self-made

Note: It is important to note that Minkowski distance is a generalization of Euclidean and Manhattan distance. Moreover, as we increase the value of h, the term having the most significant difference, i.e., the part or coordinate of data point that differs the most affects the distance most.

These three can have any value from -∞ to ∞. This makes it harder to visualize how similar two data points are, so other known measures which can be used for both numeric and binary data points are Cosine similarity and Correlation. The important fact about them is that their value varies from 0 to 1.

Cosine similarity and Correlation. Image source: Self-made

Note: Above given formulas of Covariance, Correlation, and Cosine similarity is essential and will repeat in this series again and again.

2. Probability

Most people would have already heard about probability, and others must have already guessed by the word “probable.” Probability means the likelihood of the occurrence of an event, the probability of any event carries from 0 to 1. As the main task is predicting future outcomes in machine learning, probability will play an essential role.

Mathematically, defining given an event E , the probability of the occurrence of the event is provided by P(E). It is calculated using the outcome, i.e., “failure” or “success” of the event E , also called trials. The probability P(E) is given by.

Probability formula. Image source: Self-made

2.1 Random variable and Probability distributions

In the case of a fair trial, the outcome of an event can be anything, i.e., it is random; may it be tossing a coin, rolling a die, or any other recent, so these variables are called a random variable.

The probability distribution is defined for a random variable, and it describes how the probabilities are distributed over the whole range of values of the random variable. For a random variable x the function f(x) denotes the probability of occurrence of each x ; this is called its probability distribution.

In ML, the most crucial probability distribution we need to study is Gaussian distribution or normal distribution. Gaussian distribution is said to be the main focus of statistics. Surprisingly, data from various fields can be expressed in the form of Gaussian distribution because of which is called the “ normal” distribution.

Normal Distribution. Image source: Self-made

Gaussian distribution can be easily described using the following two parameters of mean and standard deviation. As this is a very significant distribution, so it is crucial to understand clearly; you can use this link https://academo.org/demos/gaussian-distribution/ to play with the parameter to understand what effect it has on the distribution.

Now, it is clear that the gaussian distribution is a balance distribution. Nevertheless, it may be possible that the random variable is unbalanced [spoiler!], which can lead to a bad result in ML. This unbalanced nature is called the skewness in the distribution. Furthermore, this is of two types negative and positive skewness. Flowing images show the difference between these distributions.

Positive skewed(Right), Symmetrical Distribution(Middle), Negative Skewed(Left), Image source: Self-made

The skewness of any dataset can be calculated using the Person’s first coefficient of skewness.

Person’s first coefficient of skewness. Image source: Self-made

2.2 Joint probability

Consider two independent events A and B, with independent, we mean that the occurrence of one event doesn’t affect the other one. So, the probability of occurrence of both events together is P(A and B) or P(A ∩ B) where
P(A ∩ B)= P(A).P(B) . This is called the joint probability of events A and B.

2.2 Conditional probability

In layman’s terms, if we consider two dependent events A and B, the conditional probability P(A|B) means the likelihood of the event A conditioned that event B has already occurred. It is given by the formulas.

P(A|B) = P(A ∩ B)/ P(B) — — — — — (1)

P(B|A) = P(A ∩ B)/ P(A) — — — — — (2)

2.3 Bayes’ Theorem

Bayes’ theorem is one of the essential theorems for ML, and even many algorithms are entirely based on this theorem. Bayes’ theorem is helpful in defining the relation between the two conditional probabilities (1) ,(2) given above. For this, equating P(A ∩ B) from (1) and (2) . This gives us

P(A ∩ B) = P(A|B)/ P(B) — — — — — (3)

P(A ∩ B) = P(B|A)/ P(A) — — — — — (4)

From (3) and (4) we get

P(B|A)/ P(A) = P(A|B)/ P(B)

Rearranging the above equation, we get the final form of Baye’s theorem.

Bayes theorem formula. Image source: Self-made

In the above equation :

P(A|B) is called the Posterior probability of A i.e. the occurrence of A when B has already occurred.

P(A) and P(B) are called the Prior probability it is called so as we might know this information beforehand.

And P(B|A) is called the likelihood of occurrence of B is A is true.

Now, let us say that we have to predict if you will get an assignment for Monday i.e. mathematically speaking P(A = assignment| B = monday). To calculate this we can use Baye’s theorem for this we require the prior probability of assignment on any day of the week, P(A = assignment). The prior probability that today is Monday, P(B = monday) = 1/7 . The likelihood that if you will get the assignment then it will be Monday P(B = monday| A = homework) this can be calculated from previous data. This is also widely used in the field of medicine and tests whether a person is positive for a particular disease P(A = positive|B = disease).

Conclusion

In this blog, we have discussed the importance of basics to become a pro in machine learning and data science. Then we went through some important concepts with which we can summarize our data using Statistics, we read about the five-number summary and the measures of similarity without which machine learning model would be impossible. Then we jumped into probability which makes the basis of prediction in ML. We went through some basic terms and at last, we studied the Bayes theorem. In the next part of this series, we will be covering the next two basic concepts of ML, Linear algebra, and Calculus. If you liked this series do follow me and my colleague Harshit Yadav

Thank you,

Previous blog in the series: Getting Familiar to The World of Machine Learning

Next blog in the series: Four pillars of Machine Learning #2 — Linear algebra and calculus

--

--

Sarthak Malik
CodeX
Writer for

ML researcher || Artificial intelligence intern Mastercard || IIT Roorkee