Machine Learning Summary (Fundamental)

San Francisco Exploratorium

“One learning algorithm” hypothesizes that human intelligence may be due to one learning algorithm. Unfortunately, we are not there yet. Machine Learning (ML) starts from many disciplines and there are a lot of concepts to learn. Sometimes these concepts are explained with too much abstraction or on the contrary, too little substance. In this series, we summarize the fundamental like the information theory, probability, and distribution. We will look into some fundamental algorithms including Bayes inference, Expectation-maximization Algorithm, Naïve Bayes Theorem, Markov Chain, Hidden Markov Model, Matrix factorization, PCA, Moore-Penrose Pseudoinverse, Statistical Significance, etc... However, to shorten the length of the article, we assume that you have basic exposure to ML. If you have problems following the material, I suggest the section at the end for some ML learning resources. The articles are long. Feel free to skip some sections based on your level of knowledge and interests, in particular in the beginning.

Information theory

Let’s introduce some concepts in information theory. The Shannon information content is the amount of information gain when an event x occurs. Mathematically, it is defined as

For example, if a coin is biased and shows heads all the time, the events are predictable and we know nothing new. Therefore, the information content is zero. If the coin is unbiased, the probability distribution is uniform and what exactly happened is most unpredictable. The information content is the highest. Take a second to digest it, it may be at odds with your concept of information. In particular, uniform distribution leads to the highest content.

In computer science, we view the information content as the number of bits to encode the information most efficiently. Utilizing the event frequency, the optimal encoding scheme will be0b0 for head and 0b1for tail in a fair coin (or vice versa). For our discussion here, our log is in base 2. Therefore, the information content for head or tail in flipping a fair coin is -log₂(½) = 1. i.e. we gain 1-bit of information when we flip the coin once.

X is called a random variable if it holds a value generated from a random process (a stochastic process), e.g. the face value in rolling dice or the total number of heads after flipping a coin 20 times. We consider the value of X is generated from a distribution p (for example, from a Gaussian distribution).

Entropy H measures the expected information content of a random variable according to its data distribution. To calculate the value, we sum up all information content with a weight equals to its occurrence probability.

For example, in flipping a fair coin, P(X=head)=½ and P(X=tail)=½. Its entropy equals

The average content is one bit. We use an average of one bit to encode an event.


Cross-entropy H(P, Q) measures the expected number of bits to encode X with distribution P using an encoding scheme targeted for distribution Q.

In ML, we want our predictions to match the ground truth which P is the ground truth and Q is the model prediction. For example, our training objective is to train a model to minimize the cross-entropy.


KL-divergence measures the difference between two distributions P and Q — one for the ground truth and the other for the predictions made by our model.

Cross-entropy, entropy and KL Divergence are related by:

It measures the inefficiency of representing P with encoding scheme Q. However, entropy is not a function of the model parameters. Therefore, optimizing KL divergence w.r.t. model parameters is the same as optimizing cross-entropy.

We can optimize the cross-entropy or the KL-divergence in creating a model.

It takes extra-bits to encode information with a sub-optimal scheme. Therefore, KL-divergence is always greater or equal to zero. There are a few properties of the KL-divergence that need to be aware of.

KL(p, q) ≥ 0
KL(p, p)=0, and
KL(p, q) ≠KL(q, p) (non-symmetry)

Information Gain

Mutual information (or Information gain) I(X; Y) is the information obtained on the random variable X when Y is observed. The mathematical definition is non-intuitive.

Modified from source

But it will be easier with entropy

Intuitively, how much information do we gain by knowing X? For example, if we know the label (Y) of an object, we gain a lot of information about its raw image (X). We should not mistake its picture with other objects. Therefore the information gain I(X;Y) is high. This concept can be further elaborated with the relationship between the entropy and the mutual information described below.

If random variable X and Y are unrelated, their intersection is empty and the mutual information is zero. If random variables X and Y is the same, H(X)=H(Y)=I(X;Y), knowing Y is the same as knowing X. In a decision tree, we choose a condition to maximize I — a condition that gains the maximum information by separating data according to the branching condition. In another example, InfoGAN maximizes the mutual information between the generated image G and its intended label c. This encourages the GAN model to generate images related to its intended label.

Conditional Entropy

The conditional entropy H(Y|X) is the weighted entropy of Y given X=x. (E.g. H(car price | color = red).


Probability mass function is the probability distribution for discrete random variables. For example,

Probability density function (PDF) is the probability distribution for continuous random variables with lower case notation p. In principle, p(X=xᵢ)=0. The chance of having a weight exactly equals 10.0000000000 lb should be zero. Instead, we calculate the probability between two values by integrating PDF over a range (say between 9 and 10)

Cumulative distribution function (CDF)

CDF calculates p(X≤x).

CDF on the right (The CDF image source)

Conditional probability


Two random variables A and B are independent if

Marginal probability

The marginal probability P(X) is computed by summing (or integrating) the joint probability over other variables.

Sometimes, data is observed under this joint distribution or conditional distribution. In some other case, we are making a marginal inference by integrating over model parameters θ.

Chain rule

Bayes’ theorem

The equation looks simple but it is one of the most important equations in ML. So it is important to check out here to get familiar with the terminology. Researchers will explicitly use these terms in describing their concepts and make you miserable if you are not used to them.

Bayes’ Theorem can be formularized in different forms. We will use the following forms frequently.

Let’s say there is a genetic disease that only 0.1% of the population gets it. We develop a test with 99% accuracy for the positive and negative result. So, if you test positive, should you be overly worried. Intuition will say yes but in reality, the Bayes’ Theorem proves that there is only a 9% chance that you get the disease (proof). The intuition is not hard. Since the disease is so rare, most of the positive results from the lab come from false positives of people without the disease — a small portion of a large pie is still larger than a big portion of a very small pie.

Bayes inference v.s. Point estimate

In linear regression, we predict a definite value (a point estimate) based on the input features. For example, the price of the house can be calculated from the square footage and the number of rooms (price = a × square footage + b × # of rooms + c).

In Bayes’ Theorem, both the prior and likelihood is a probability density function (PDF). The calculated posterior is also a PDF. With Bayes’ Theorem, we predict a distribution of values and our certainty on these values. As shown later, this gives us flexibility in combining a prior belief or opens the door for further mathematical manipulation using probability and distribution theory. For engineers, our calculations are very deterministic. So please take some extra time to digest this probability concept. We are working on a curve rather than a single point estimation.

Probabilistic models & non-probabilistic models

In general, models can be categorized as probabilistic models and non-probabilistic models. Probabilistic models use probability distribution to model a problem and to make predictions. It is about reasoning about uncertainty. These models are usually built on MLE (Maximum likelihood estimation) or MAP (Maximum a posteriori estimation). For example, in linear regression, we optimize our model parameters w to maximize the likelihood of the observed data. In Bayes’ Theorem based solutions, we are interested in more than a point estimate. Instead, we are interested in estimating p(x), p(y|x) or p(model θ| X). We answer questions like what are different hourly rates of handymen and the corresponding probability. We model the uncertainty in our data, our model and/or our answers. Of course, this may not be easy. If the analytical solution is intractable, we approximate the solution in reducing the problem complexity. Non-probabilistic models do not use distribution to model them. Some of the examples include clustering, SVM, decision trees, etc …

Advantages and disadvantage of Bayes’ Theorem

In some ML problems, finding P(y|x₁, x₂, x₃,  ) can be hard. With Bayes’ Theorem, we compute P(x₁, x₂, x₃, …|y) instead. In some problems, this may be easier because the collected observations may answer them better. As shown later in Naive Bayes’ Theorem, this probability can be further broken down into independent components that can be solved easily. Also, in early experiment stage, the evidence is weak since we have not collected enough samples yet. But if we have a solid prior belief, we can still make compelling inferencing using Bayes’ Theorem. Don’t worry about the details now. More specific examples will come soon.

However, the equation for Bayes’ Theorem is deceptively simple.

The calculation in our disease example is very simple because the number of states is small. However, multiply the prior and the likelihood curve in high-dimensional space or continuous space is often intractable. The integral for the marginal probability in the denominator is hard to compute also.

In practice, we can solve them by sampling or approximation. For example, we can model the prior and the likelihood with Gaussian Distributions. The posterior distribution will be a Gaussian and easy to compute analytically.

In ML, we use Bayes inference to find the model w using the posterior distribution.

The marginal probability in the denominator sums over w which is no longer a function of w. Hence, it is a constant w.r.t. w. Therefore, we can ignore it. We can find the same optimal solution with or without the marginal probability.

Naive Bayes’ theorem

We can apply Naive Bayes’ theorem to solve classification problems. As explained before, Naive Bayes’ theorem helps us to solve P(y|x) when P(x|y) is easier to calculate. In particular, when x is composed of independent events, P(x₁, x₂, …|y) can be further simplified as P(x₁|y)P(x₂|y)… Therefore, P(y|x) can be computed as

Originally, the exponential combination of x₁, x₂, …, xn makes it too hard to collect data to estimate P(y | x₁, x₂, …). After applying the Bayes’ Theorem, we break down the likelihood into independent events. In ML, exploiting independency P(A, B) = P(A) P(B) or conditional independence P(A, B|C) = P(A|C) P(B|C) are important to avoid exponential combinations and to make a problem solvable.

Let’s use Naive Bayes’ theorem to detect spam emails. People often mark their spam emails. We know what words that spam emails may use frequently, like “getting rich”. We can use the P(“rich”|spam) and P(“rich”|not spam), etc… to determine whether we get a spam email.

By decomposing P(x₁, x₂, …| y) into P(x₁|y) , P(x₂|y) …, we can compare the corresponding probabilities above to determine whether it is a spam. (for more details on this example)

We can also use the frequency of words to detect spam. For example, the word “money” will likely occur more frequently in a spam email. Here, we use the Poisson distribution to model P(xⱼ | spam). Because spam and non-spam email have different word frequency, each classification will use a different λ for the corresponding Poisson distribution i.e. λ will be different for P(xⱼ | spam) and P(xⱼ | not spam) below.

Plugging P(x|spam) and P(x|not spam) into the Naive Bayes’ theorem, we solve the problem.

Bias & variance

In supervised learning, we supply labels for the training data to build a model for f(x).

As shown in the proof here, the mean square error (MSE) is composed of errors from the bias, variance and data noise. Since noise is unrelated to the model and irreducible when training a model, we will ignore it from our discussion.

Variance is the error that is sensitive to what data points we sampled in the model training. A complex model is vulnerable to overfitting if the sampled training dataset is too small. Once it happens, the model and the predictions may change significantly if different training data is sampled.

With N 1-D training data input, we may fit a model perfectly with an Nth-order polynomial. The red rectangle region below is overfitted and results in bad predictions. If the last 2 data points are not included in the training dataset or more data points are sampled in the red rectangle area, the prediction will be different. In short, the training has a high variance if the model is complex with not enough data to fit it.

On the other hand, bias happens when the model is not complex enough to make accurate predictions. If the ground truth for x is around the center circle below, bias shifts the answer in one particular direction while variance spread the predictions around the ground truth.

Below is a demonstration of the model complexity. As the complexity improves, it fits the training data better until it is overfitted. At that point, the training error continues to decrease. But when we validate the model with data points that are not used for training, the validation error is large and the it goes up as the number of iterations increases.

The following figure summarizes the relationship between model complexity, bias, and variance. The solution in reducing the variance is to increase the training dataset size, reduce the model complexity or add regularization.



In many ML problems, we assume data points are drawn from the same data distribution and the observed value does not impact the observed values of others.

This is a simple concept but the success of many ML and DL algorithms highly depends on it. Model training is more stable if data points and training batches are close to i.i.d.

When we roll a dice 100 times, each data point is originated from the same distribution. When throwing a dice, the result is independent of previous results and therefore, throwing dice is i.i.d. However, if we draw from a deck of cards without replacement, the sampling distribution will change and therefore it is not i.i.d. Such an assumption can simplify our math in many situations. For example, the joint distribution in the maximum likelihood estimation (MLE) can be simplified as independent events.

This is critical. By breaking down a model into manageable independent subcomponents, we reduce the complexity significantly.

Covariate shift

However, if we have a model trained with fair dice, the prediction will be wrong if our inference input has a different distribution, say from a biased dice. This is the covariate shift.

In some ML or DL problems, states may be highly correlated. This correlation is higher for time sequence model. If we use gradient descent in optimizing those models, the i.i.d. assumption is wrong and the corresponding training can be very unstable.

Let’s take a break for the fundamental concepts and interest you with some complex ML algorithms first before coming back to data distribution.

Expectation-maximization Algorithm (EM)


Many ML algorithms are originated from the EM algorithm. But it is one of the underrated algorithms in studying ML. To make our discussion concrete, let’s start with a clustering example. We want to find an optimal way to group 100 data points into two clusters. In this example, we use the Gaussian distribution to model data distribution. The chance of a data point xᵢ belonging to a Gaussian cluster (μ, σ²) is:

Our objective is to maximize the likelihood of our observation (x₁, x₂, …) based on the Gaussian clustering models and cluster assignments.

In inference, we want to predict the cluster assignment of a new data point. In this prediction, we need θ₁ only. So how do we find the optimal θ₁ from the training data?

We start with a random guess for the clustering model θ₁ (μa, σa², μb, σb²). We infer the probability of the cluster assignments P(θ₂) for each data point accordingly. Based on

, we conclude that for the data point xᵢ=3, it has a 0.2 chance to be in cluster A and 0.8 chance to be in cluster B. Then based on P(θ₂), we optimize the likelihood of the observation x w.r.t the clustering model θ₁ (arg max p(x |θ₁ )). We recompute the mean and the variance of each cluster by using the assignment probabilities. However, θ₂ depends on θ₁ and vice versa. We are working with multiple moving parts in a circular loop.

Because of the dependency, we cannot optimize θ₁ and θ₂ simultaneously by taking both derivatives at the same time. Instead, we fix θ₁ and derive θ₂. Then we optimize θ₁ with θ₂ fixed. We repeat the iterations until θ₁ converge. In each iteration, our objective will improve. Assuming a limited precision, we are dealing with a finite space for θ₁ and θ₂. Therefore, there is a finite step for θ₁ and θ₂ to improve and our iteration should reach a local optimal.

The EM-algorithm composes of E-step (likelihood estimation) and M-step (maximizing the likelihood). In the E-step, we fix the Gaussian models θ₁ (μa, σa², μb, σb²) of the clusters and compute the probabilities P(θ₂) (the chance xᵢ belongs to cluster A or cluster B).

Once, we compute P(θ₂), we can reconstruct the marginal likelihood P(x|θ₁) by summing P(x, θ₂|θ₁) over θ₂. Now, we have a function of likelihood w.r.t. θ₁ and we can optimize it to find θ₁.

Below is the E-step and the M-step in our example. Instead of the usual differentiation of the likelihood function, the optimization step can be done pretty simple. For example, the mean of cluster A can be computed by a weighted average of all data points with the weight equals to P(aᵢ) for the data point i. This algorithm is used in the Gaussian Mixture Model GMM. Don’t worry too much on the details as we will discuss it in here later.

Mathematical model

(Credit: the equations in this section are originated from here.)

Our GMM example is a special case for the EM algorithm. Let’s develop the math and generalize the concept a little bit more. Given xᵢ is i.i.d., data is drawn from the same distribution Ɲ and independent of each other.

Therefore, maximizing the likelihood of the observation is equivalent to maximizing the sum of log-likelihood of each observation.

Expectation-maximization (EM) algorithm is a general class of algorithm that composed of two sets of parameters θ₁, and θ₂. θ₂ are some un-observed variables, hidden latent factors or missing data. We don’t really care about θ₂ for inference usually. But if we try to solve the problem, we may find it much easier to break it into two steps and introduce θ₂ as a latent variable needed in the intermediate step. Since we don’t know both θ₁, and θ₂ initially, we need to fix one variable at a time and deriving or optimize the other variable in an alternating step. First, we derive θ₂ from the observation and θ₁. Then we optimize θ₁ with θ₂ fixed. We continue the iterations until the solution converges.

In the M-step, we maximize the likelihood of the observations w.r.t. θ₁. The log likelihood ln p(x|θ₁) can be decomposed as the following with the second term being the KL-divergence (proof).

where q(θ₂) can be any distribution.

Now, let’s make the R.H.S. as simple as possible so we can optimize it easily. First, let’s pick a choice for q(θ₂). A natural one will be using p(θ₂|x, θ₁) — find the probability of θ₂ given the observation and our model θ₁. By setting q(θ₂) = p(θ₂|x, θ₁), we get a nice bonus. The KL-divergence term becomes zero. Now our log-likelihood can be simplified to

Our optimal solution is the same as optimizing

Let’s recap. In the E-step (expectation estimation step), we compute q(θ₂) and re-establish the equation for the log-likelihood ln p(x |θ₁). Optimizing the log-likelihood will be the same as optimizing the expectation of the log of the joint distribution p(x, θ₂ |θ₁), i.e. the marginal log likelihood probability ln p(x |θ₁).

In the M-step (maximization step), we find the optimal θ₁ using the R.H.S above as the objective function.

Here is the EM-algorithm:

For the solution to work, we assume we know how to model p(θ₂ | x, θ₁) and p(x, θ₂ | θ₁). The success of the EM algorithm subjects to how simple are they and how easy to optimize the later one. In our example, we use a Gaussian distribution.

As mentioned before, each iteration improves the likelihood (proof). With finite state of θ₁, our solution will converge. However, it reaches a local optimal only unless the likelihood is a convex function. In practice, even a local optimal can show us some good results.

Narrative of the EM algorithm

In some ML problems, it is intractable to solve the problem with our perceived model parameter θ₁. Instead, we can solve the problems in alternating steps by adding latent parameters θ₂. By breaking down the problem, the equations involve θ₁ θ₂ θ₁. It can be solved much simpler. Our objective is solving either the L.H.S. or the R.H.S. below.

If we can model p(x | θ₁, θ₂) with a simple model that can be optimized easily, we don’t need the EM algorithm. Just use MLE. But as shown below, it is not the case for GMM.

For the GMM problem, the joint distribution p(x, θ₂ | θ₁) in the EM algorithm is modeled as a Gaussian. Its log is simple. q(θ₂) is set to p(θ₂ | x, θ₁) with θ₁ fixed. It is modeled as a Gaussian also. Optimizing this formularization is easy. On the contrary, the equations for p(x | θ₁, θ₂) is very complex. Therefore, we choose the EM algorithm to solve GMM.

So if the problem can be modeled easier with its subcomponents (the joint distribution) and we know how to model p(θ₂ | x, θ₁) easily, EM algorithm will have an edge over MLE (otherwise, no).

Another example in demonstrating this idea with the PCA can be found here.

In MLE or EM, θ₁ and θ₂ may depend on each other. Therefore, we optimize θ₁ or θ₂ one at a time with the other fixed. Iterating between alternating steps, the solution will converge.

Minorize-Maximization MM algorithm

The EM algorithm can be treated as a special case for the MM algorithm. With Jensen’s inequality, we can establish a lower bound for the log-likelihood as below.

where Jensen’s inequality for the log function is demonstrated below with k = 2 and q(θ₂_k) are the αᵢ.

Without too much elaboration, let’s understand the MM algorithm graphically. For a random guess on θ, we establish a lower bound function 𝓛 at θ. We find the optimal point for 𝓛 and use it as the next guess. We continue finding a lower bound function and optimize it as a way to optimize a function f. The MM algorithm focuses on the idea that f may be very hard to optimize but we can find lower bound functions that are much easier.

Modified from source

The EM algorithm becomes

Missing data

Previously, we introduce latent variable θ₂ to complement θ₁ to solve our optimization problem. The EM algorithm can be used to estimate missing data in the training dataset also. For example, the training data may be collected from user surveys. But users may answer a subset of questions only and there will be different missing data in each survey. In this situation, θ₂ will be the missing data that we want to model and θ₁ is the answers provided in the surveys. Here is how we formulate the log-likelihood using our EM equations.

We will not elaborate on the solution further but the steps to find the solution can be found here.


Here is the recap of some terminology.

Markov process/Markov chains

A first-order Markov process is a stochastic process which the future state solely depends on the current state only. When we discuss the first-order Markov process, we often reference it as the Markov process. If it is in a discrete space, it is called the Markov chain.

The assumption of the Markov process may not be true. But even it is not true, we can model extra states to make it closer to the Markov process sometimes. In general, the Markov process is a good approximation to solve complex problems in ML or reinforcement learning.

The probability of the transition from one state to another can be packed into a transition matrix like the one below:

This transition matrix is also called the Markov matrix. The element ij is the probability of transiting from state j to i. Note, some literature may use a transposed notation where each element is the probability of transiting from state i to j instead.

The columns of a Markov matrix add up to one — the probability of reaching a state from any possible state is one. Once, it is defined as a matrix, we can use linear algebra and eigenvector to determine its stable state if existed. As we keep continue the transition, what is the probability of being in each state?

Let’s have a quick review of eigenvectors. Eigenvector vᵢ and eigenvalue λ of the matrix A fulfill the following relation. Also matrix A can have many eigenvectors.

Our state at time k+1 is related to the previous step by the Markov matrix. The stable state is when t approaches ∞.

If uᵢ is an eigenvector of A,

where λᵢ is the corresponding eigenvalue for uᵢ.

Consider a vector v₁ in ℝ. We can decompose it using the eigenvectors of A. The state of v at time step k+1 will be

v will have a stable state if v converges in time. uᵢ are chosen to be unit vectors. In order for v to converge, eigenvalues λᵢ must be smaller or equal to 1. Otherwise, ‖v‖ will continue to grow.

A Markov matrix always has an eigenvalue 1. All other eigenvalues have their absolute value smaller or equal to 1. Let’s say, we have an eigenvector u₁ (say 0.2, 0.5, 0.3) with eigenvalue 1. u₁ will be the stable state, i.e. we have 0.2, 0.5 and 0.3 chance to be in state 1, 2 and 3 respectively as the time approaches infinity. The solution is independent of the initial state. We end up with the same target distribution regardless of where we start. (More details can be found here.) In theory, we can have more than one eigenvectors with eigenvalues equal to one. However, in practice, real problems usually have only one. In fact, if all elements in the matrix are greater than zero, there is exactly one eigenvector with eigenvalue equals one.

Random walk

Many Markov processes can be solved with random walks. Let’s say we drop off 100 shoppers randomly around the downtown area in San Franciso. We provide a transition matrix to show the probability of where the shoppers should head next. Eventually, we can spot where most interesting shops are located.

Instead of providing a global plan, the transition matrix may compose of local information only — the probability of moving ahead, turning right or turning left based on the stores that we can see from our current location. Therefore, the transition matrix is mostly sparse. As we allow shoppers to explore, we can locate the shopping area fast. This strategy allows us to use local information to understand the general structure of the data. In many ML problems, it is much easier to collect local and partially complete information. We don’t need to understand the structure of the data. We don’t need to understand how the city plans its shopping districts. Just look around and see what may be more interesting. So this random walk concept is very popular in ranking or making product recommendations. It fits well on how information is collected.

To find a stable state, we can build up a transition matrix and solve it analytically using eigenvector decomposition. Alternatively, we can use sampling to find the next state. As we continue the iterations, our random walk will converge to the stable state that we are interested in. For very large scale problems, this may be easier to execute and to compute.

Hidden Markov Model (HMM)

In many ML problems, we assume the sampled data is i.i.d. This simplifies the maximum likelihood estimation (MLE) and makes the math much simpler to solve. But for time sequence model, states are not completely independent. If I am happy now, I will likely stay happy tomorrow.

The state of a system may not be observable or fully observable. But we can get insights about this internal state through an observable. For example, if I am happy, there is a 40% chance that I will be found at a party. But let’s make things more real. There is a 10% chance that I will be found at a party when I am sad too. With HMM, we determine the internal state (happy or sad) by making observations — what I am doing.

HMM models a process with a Markov process which includes the initial state distribution π and the transition probabilities A from one state (xt) to another. HMM also contains the likelihood B of the observation (yt) given a hidden state. Matrix B is called the emission probabilities. It demonstrates the probability of our observation given a specific internal state. The complexity comes from the problem that different states can map to the same observation.

Two major assumptions are made in HMM. The next state and the current observation solely depend on the current state only.

Given all the observable and the initial state distribution, we can compute a pretty complex equation for the probability for the internal state xt P(xt| y₁, y₂, y, … , yt). For simplicity, we will not include π in our equation. All equations assume π is a given condition, like P(y) → P(y|π).

The equation above uses the transition probability and the emission probability to compute the probability of the internal state based on all observations. Depending on the situation, three different types of questions are often asked in HMM problems.

  • Likelihood: How likely is the observation based on the current model or the probability of being at a state at a specific time step.
  • Decoding: Find the internal state based on the current model and observation.
  • Learning. Learn the HMM model.

The remaining section details the solution. Read through it according to your level of interest.

Likelihood (likelihood of the observation)

Let’s define the term α as the probability of being in a state (e.g. happy) at time t for all the observations up to t.

Here are the steps in calculating the likelihood of the observations given the model λ. It will be easier to understand with the example followed.

Here is an example which we start with the initial state distribution on the left. I will let you dig out the information yourself.

Decoding (find the internal states — Viterbi algorithm)

Finding the internal states that maximize the likelihood of observations is similar to the likelihood method above, except α is computed by finding the value of the maximum path.

In this algorithm, we also record the maximum path (the red arrow) to track the state at time t. e.g. we are transited from a happy state H at t=1 to the happy state H at t=2.


The graph below is the famous dishonest casino example. There are one fair dice and one biased die that a dealer can use. We provide a transition matrix for using the fair dice or a biased dice next. To avoid detection, the switching between the die does not happen regularly. We also provide the value distribution for each dice. For example, for the fair dice, the dice value will be uniformly distributed. We use the algorithms described before to make some inference by observing the value of the rolling die given we don’t know which one is used.


Our internal state is whether we use the fair or the biased dice. The line curve is the likelihood to be at a particular state at time t. It fluctuates a lot. It gives a narrower view of what may happen at time t. The shaded area is the likely transition between the fair and the biased dice using the Viterbi algorithm. It does not fluctuate a lot and reflects the transition matrix better. It gives a global view on when states on transited. So both algorithms serve different purposes.

Learning (Baum–Welch algorithm — build the model)

HMM learning is about finding the model λ given the observation. We will quote the equations from Wikipedia and adopt its notation below. Spend a few seconds in the definition, in particular, γ & ξ.

γ & ξ are the latent factor θ₂ that we want to discover in the EM algorithm. γᵢ (t) is the probability of being at state i at time t given all the observation. ξᵢⱼ (t) is the probability of transiting from state i to state j at time t given all the observation.

αᵢ (t) is the probability of seeing the observations so far (up to time t) given the state is i at time t. (forward direction). Given the state i at time t, βᵢ(t) computes the probability of the “would-be” observations after time t. α is computed using the previous likelihood section. β can be computed similarly but instead of computing in a forward direction from t=1 to the end, we do it backward from the end t=T.

Start from time=T, compute β backward using the transition prob. a and observation prob. b

With α and β (θ₁ in EM), we can compute the latent factor (θ₂) γ and ξ using the EM algorithm. The first equation below is γᵢ (t) and the second equation is ξᵢⱼ (t).

In the E-step, fixing θ₁, we compute the latent variables (θ₂) — the chance to be in state i and time t (γ) and the transition probability (ξ) from i to j at time t given all the observations. In the M-step, we optimize the model parameter θ₁ given the fixed latent variables γ and ξ. We train them in alternative steps and the model parameters will improve and converge. The following is the detail equations. We will like you to dig deeper into the equations yourself.


Let’s get back to something easier. In this section, we will discuss types of distributions like the Gaussian, the Gamma, etc … They have different parameters and different shapes. A Gaussian has a bell shape, and a Gamma distribution has a unimodal shape or it is exponentially decaying. Since Bayes’ Theorem or many ML methods involve distributions heavily. It is important to shop around to find the appropriate distribution for your data and your models. Finding the right one in resembling the shape means you can simplify the calculation dramatically.


For a continuous random variable, the expectation of f(x) is

For a discrete random variable, it is

Variance and covariance

For a continuous random variable, it is

Covariance indicates whether two variables are positively or negatively related. If it is zero, they are unrelated.

Variance can be expressed as the equation below (proof).

This formalization is useful for many proves including finding the bias and the variance for MLE estimator.

The covariance for X & Y is

Sample variance

Sample variance is an estimator for the population’s variance by collecting m data points.

The mean of the sampling data is correlated with the data itself. (xᵢ-μ)² will be smaller than that of the population. To compensate that, we need to divide it by m-1 instead of m (proof). Note that, for the sample mean, we just divide the sum by m and it is already unbiased.


Covariance does not account for the variance of each variable. It shows how variables are related but it does not quantify the extent well. Each variable is kind of on a different scale (or units). Correlation normalizes the covariance such that each variable corresponding to the same scale.

Gaussian distribution/Normal distribution

In a Gaussian distribution, 68% of data is within 1 σ from the μ and 95% of data is within 2 σ. The standard normal distribution has μ=0 and σ =1.

Multivariate Gaussian distribution(Gaussian Distribution with N variables)

Multivariate Gaussian distribution models multiple variables distribution using a Gaussian model. The plot below is a bivariate Gaussian distribution which two random variables involved.


The Gaussian distribution for multivariate is

where Σ is the covariance matrix. Each element in this matrix records how two random variables are correlated. We can sample a multivariate Gaussian distribution by

Bernoulli distributions

The Bernoulli distribution is a discrete distribution with two possible outcomes, one with probability 𝜙 and the other 1-𝜙.

Binomial distributions

The binomial distribution is a discrete distribution from n Bernoulli trails.


Variance in estimating θ in Bernoulli Distribution (proof)

Poisson Distribution

Assuming a rare event with an event rate λ, the probability of observing x events within an interval t is (proof):

Dirac distribution

Gamma distribution

Here is the mathematical definition for the Gamma distribution.

Depend on the values of α and β, part of the equation is growing and part of the equation is decaying.

Therefore, in contrast to Gaussian distribution, Gamma distribution can generate distributions of many shapes. Depending on the values of α and β, the shape may resemble a single mode distribution or an exponential decay function


The expectation and the variance of the Gamma distribution are:

Modified from source

If we have some estimation on the mean and the variance of the data that we want to model, we can reverse engineering the equations above to find α and β. Unlike the Gaussian distribution, by definition, the PDF for the Gamma distribution is zero for x is negative. This is a nice constraint in modeling some real problems that x, like height, can never be negative.

Beta distribution

The beta distribution is defined as

Here are different distribution shapes using different parameters of a and b. As shown below, Beta distribution can model U shape and uniform distribution.

The mean and the variance of the Beta distribution are:

Modified from source

We will demonstrate the application of beta distribution in Bayesian inference later.

Dirichlet distribution

Dirichlet distribution is a distribution of probability distributions. Let’s make it less abstract. A radio station DJ is going to choose songs from the rock, alternative rock and pop music for the next hour. The ratio of the songs is about 0.3, 0.2 and 0.5 respectively. But one day, the DJ wants to readjust the ratio based on the feedbacks collected from the listeners. As we learn, probability theory deals with probability distributions rather than point estimations. So the DJ needs to replace the point estimation concept with a distribution concept. Now, the DJ has the random variable θ composed of 3 components (θ₁, θ₂, θ₃) which each one represents the probability distribution for a specific genre to be picked. The DJ wants to learn the distribution of p(θ) = p(θ₁, θ₂, θ₃) with the condition that θ₁, θ₂, θ₃ are all non-negative and add up to 1. The Dirichlet distribution notation is

where α is the parameters for the Dirichlet distribution. It controls the shape of the distribution. Since θ has 3 components, α composes of 3 values also so we can control the shape between θᵢ easily. The Dirichlet distribution is defined as:

Under the constraint that the probabilities of all possible choices add up to 1, the values for θ₁, θ₂, θ₃ lays on a triangular plane.

Let’s not get stuck at the equations and focus on the shapes of the distribution. The diagram below shows some of the possible shapes for K = 1 and K = 3 with different αᵢ.

Source 12

(For K=3, the viewing angle of the triangle is changed so the probability distribution can be plotted in the vertical direction.)

As shown, it can represent uniform distribution, skewed and unskewed unimodal shape and multiple modal shapes with different parameters. To understand the relationship between α and the shape, let’s summarize some of its property first.


For example, when all αᵢ are equal. E(θ₁)=E(θ₂)=E(θ₃). In fact, to have a uniform distribution for θ₁, θ₂, θ₃, we can set αᵢ all equal to 1.

The source of the plot

If we want the choice to gear towards one specific genre, we make the corresponding αᵢ to be much higher than 1 and much higher than other αⱼ (assume αⱼ ≥1). When αᵢ<1, p(θ) will approach infinity when θᵢ approaches 0. Therefore, this genre will be unlikely picked. Let’s make all αᵢ to be equal and name it as γ. As γ increase, the sampled distributions will get more and more uniform.

Modified from source with K=10

Now, our DJ has created a probability model for the choices. Indeed, Dirichlet distributions are commonly used as prior distributions or the conjugate priors for the Bayesian statistics. We will elaborate its usage further in topic modeling.

Chi-squared distribution

Modified from source

Exponential and Laplace distribution

Modified from Wikipedia


Central limit theorem

Consider rolling dice once, the value of the throw is uniformly distributed if the dice is fair. Let’s throw the dice many times and we average the result. We repeat the experiment many times and collect all the aggregated results. When we plot these results, we will realize the results are Gaussian distributed. The central limit theorem says that we can have many independent random variables, say each represents the result of a throw. When we add them together and normalize it (average it), the normalized results after many trials tend to be Gaussian Distributed. The theorem extends to any distribution for the independent variables. So no matter whether the dice is fair, biased or with any data distribution, the normalized results are Gaussian Distributed. This implies that individual variables may have different data distribution, we can use a Gaussian distribution to model its aggregate results.

If the random variable is already Gaussian Distributed with variance σ², the normalized results will be

i.e. the variance of the aggregated result drops as the result is average from a larger sample size.

Bayesian inference

Frequentist inference draws conclusions from the frequency of events. If we get two heads from flipping a coin twice, p(head) equals 100%. However, a frequentist will unlikely publish such a result because the sample size is too small. The result may be just by chance and it is too early to spot any trend.

Bayesian inference uses Bayes’ theorem to derive a posterior distribution from the likelihood and a prior belief. When more observation is available, we turn the posterior into the prior and compute the posterior given the new evidence. Since the posterior is modeled as a certainty distribution rather than a point estimate, we can continue combining it with new evidence to form a new belief.

For example, from the previous measurements (or observations), the dynamic model on how a car position changes and even naive believe, we can form a prior belief on the current position of a car. Given the current sensor reading (like GPS), we form a likelihood of our current sensor reading given different locations. Using Bayes inference, we can derive the probability distribution P(H|E) of the car location given the sensor readings.

We turn the posterior into the prior for our next iteration when new observations are made. If the sample size is small, the likelihood curve is wide with a lower peak. We have not drawn enough data to rule out many possibilities. Therefore, the posterior will resemble the prior belief if it is strong (narrow and peaked). When more data is collected, the likelihood will get pointy and the posterior distribution will get closer to the likelihood curve.

illustration purpose only

Frequentist v.s. Bayesian

Frequentist applies the maximum likelihood estimation to find an optimal model parameter in explaining the observations. The Bayesian put the spotlights on the model parameter θ and calculate the posterior for the model parameter using the Bayes’ Theorem.

It calculates the probabilities of different models given the observation. Of course, this can be extremely complex for high dimension or large continuous space. Further simplification on the likelihood and the prior model can be applied. Or we can solve the problem by sampling or approximation.

Depending on the way how samples are collected, it may be easier to answer P(x|y) than P(y|x). Sometimes, the probabilities are easy to model in the opposite direction. For example, P(y | x, θ) and P(θ) are often modeled with Gaussian or Beta distribution. Below is an example of the Bayesian linear regression.

Linear regression

We ignore the denominator P(y |X) in the Bayes’ Theorem because it is not a function of θ and can be dropped when we optimize the posterior w.r.t. θ. For both P(y | x, θ) and P(θ), we model them with separate Gaussian models in the Bayesian linear regression. In practice, P(y |X) or P(X) is often very hard to compute, so this is a very nice simplification in optimizing the posterior.

If you cannot solve P(y|x) easily, transform the problem with Bayes’ Theorem and try your luck in the opposite direction P(x|y).

In the Bayes’ Theorem, we have relatively large freedom in choosing the model for P(θ). But not every choice is equal. Since the posterior can be used as the prior in our next iteration, we may want the posterior and the prior to be in the same family of distribution. So we can apply the same math for the next iteration and compute the solution analytically. A prior is a conjugate prior if the corresponding posterior belongs to the same class of distribution. For example, both are Gaussian distributed.

Let’s look at them in detail but we will model the prior with a Beta distribution instead.

Bayesian inference with Beta distribution

For a Binomial distribution, we can model it with a beta distribution. If the likelihood is binomial or Bernoulli, we will choose our conjugate prior to be beta distributed. The posterior will be beta distributed also and easy to compute analytically.

The detailed example demonstrating the Bayesian inference with Beta distribution can be found here. Here is the skeleton on finding the posterior using beta distributions where we use beta distributions for both P(data|θ) and P(θ). The posterior P(θ |data) will be beta distributed also.

Let’s consider the infection rate for a person exposed to a virus. If we have no prior knowledge, we can start the prior with a uniform distribution (left below). The posterior in the Bayesian inference will resemble the frequentist result since our belief is weak.

Otherwise, we can start with some prior knowledge based on past experience, knowledge or even gut feeling. However, if our belief is off, we need to collect more data to gradually reshape the posterior to the correct curve.

While we can start with a non-informative prior, the strength of Bayes inference depends on how easy and reliable to form our prior belief.

Gamma distribution as a conjugate prior

We can use a Gamma distribution as the conjugate prior if our likelihood can be modeled by a Gaussian distribution.

The Gaussian distribution for our likelihood p(x|θ) can be expressed in the following form.

Apply the Bayes’ Theorem, we can derive the posterior in the form of Gamma distribution also.

Prediction & regularization

With Bayes’ Theorem, we calculate the posterior for the model θ given the observations. If we assume the model parameter θ to be a zero-centered Gaussian distribution, the prior p(θ) turns into an L2-regularization term in the objective function (details). Conceptually, p(θ) can be viewed as a regularization factor in penalizing a cost function if θ deviates from our prior belief. As shown below, it can be a very complex function.

To make a new prediction, we use the posterior p(θ | X, y) in our training as p(θ). Then we find the marginal probability p(y|x₀) by integrating over θ. This is Marginal inference. We compute the probability of a variable by summing everything else out.


Jacobian matrix & Hessian matrix

These matrices are the first order and the second order derivatives of f respectively.

This notation is called the numerator layout. Hessian matrix is symmetrical. The upper bound for a quadratic equation with the Hessian matrix and a vector v is

Below, we use a different notation called denominator-layout notation. It is the transpose of the numerator layout.

Here is the result in differentiating a vector and a matrix in such notation.

Matrix Decomposition

Graphical interpretation of the component analysis

We can represent a 2-D vector x by projecting x onto the x-axis and y-axis. So a data point can be represented as (xᵢ, yᵢ). The choice of x-axis and y-axis is pretty arbitrary. Instead, we can select a unit vector q and calculate the projection of x on q. The projection vector will be qqᵀx with its magnitude equals qᵀx.

In ML, we extract features from a high-dimension space into a low-dimension latent space (say with k-dimension). Conceptually, we project x onto k different vectors qⱼ. The choices of qⱼ are important. If it is done correctly, we can use fewer components in representing information. For example, if we choose q₁ and q₂ below, we may ignore the q₂ component (the blue dots). They may be too small and we can ignore them. That is not true if we pick the x-axis and y-axis.

SVD decomposes a matrix into independent components. All q selected in SVD are independent of each other (orthogonal), i.e. the extracted features are not correlated. Conceptually, SVD picks the first q that minimizes the least square error below if the rest of the components are dropped.

Modified from source

XXᵀ is symmetrical. The optimal q (named q₁) will be the eigenvector for XXᵀ with the largest eigenvalue λ or the largest singular value σ (λ = σ²). (proof).

Then we pick the next component based on the same principle with the condition that q are orthogonal to each other. The selected q₂ will have the second largest eigenvalue. We can continue the process until we run out of eigenvectors. But the description here is just for demonstration. We don’t solve this problem sequentially. It simply helps us to understand the principal of the feature extraction behind SVD.

Singular Value Decomposition (SVD)

SVD is presented differently in linear algebra. It is the same method but from a different angle. Any matrix A can be decomposed as (proof)

where U compose of uᵢ — the eigenvectors for AAᵀ and uᵢ are orthogonal (perpendicular/independent) to each other. (uᵢ is equivalent to qᵢ in the last section.) Similarly, V is composed of eigenvector vᵢ of AᵀA which is also orthogonal to each other.

From the equation above, A can also be written as

where uᵢ and vᵢ are unit vectors. So when we evaluate the importance of the decomposed components, we can ignore those terms with very small σᵢ.

If we keep only the topmost k terms with the largest σᵢ, we effectively reduce the rank of A into k, i.e. the extracted features is in k dimension only. We effectively reduce the dimension of the input with the consideration of the importance of each principal component. That is what PCA does.

Principal Component Analysis PCA

As shown below, PCA simply truncates those insignificant terms in SVD (terms with very small σᵢ).

Intuitively, two input features may be so correlated that you can create a new feature to represent both. For PCA, we want to find k independent features to represent our data.

PCA Example

In ML, SVD decomposes a matrix containing training data into independent features. For example, the row of the matrix contains the movie ratings from a user. The column contains the user ratings for a movie.

If we pick the top K eigenvalues of AAᵀ, its corresponding eigenvectors is equivalent to the top K optimized qk vectors below:

Modified from source

Recall that we project x into these principal components qk. Finding the top K optimized qk, we reduce the dimension of x into K. The projected vector is the latent factor k of x.

We can concatenate qᵢ to form the matrix Q. We can derive the latent features of user by multiplying Qᵀ with a user’s movie ratings. (qᵢ is M × 1 where M is the number of movies, and Q is M × K.)

SVD discovers the patterns (principal components) on user ratings. We can imagine some of the principal components may represent the genre of the movie or the years of the release. For example, the first component in zᵢ may indicate whether a user likes comedy.

Probabilistic PCA

In SVD, we decompose X to USVᵀ. Probabilistic PCA models X≈WZ. We will use the EM algorithm to learn W and Z which Z can be treated as the latent feature of X. Unlike SVD, W does not need to be orthonormal. Columns do not need to be of unit length or perpendicular to each other.

First, let’s assume the latent variable zᵢ is a zero-centered Gaussian distribution. With W, we can reconstruct the original data X by WZ where x is modeled by a Gaussian also.

Z is the latent variable θ₂ in the EM algorithm and W is θ₁. Our objective is

In the E-step, we compute the Gaussian distribution for q(zᵢ).

In the M-step, we optimize

Without proof here, the algorithm is:


Kernel PCA

From one perspective, PCA finds a set of vector q that maximize qᵀXXᵀq. Since XXᵀ is symmetrical, q will be the eigenvectors of XXᵀ with the largest eigenvalues (details).

So the problem statement can be rewritten as finding the eigenvectors with the largest eigenvalues.

In the article on ML algorithms, we replace XXᵀ with a kernel to map the input to a higher dimension. This allows us to create a linear boundary to classify data that is not linearly separable in the low dimension space (details). On the contrary, PCA is often considered as a dimension reduction technique. So both technique seems to go in the opposite direction. However, sometimes, we need to go big before go small. Going to a high-dimensional space allows us to cluster information with a simpler and clear boundary. Once the information is clearly cluster, it will be easier to map it to a lower dimensional space. This is the motivation behind PCA kernel. Let’s start with the following equation.

After some manipulating, we get

Modified from source

So given the matrix K is holding the kernel result, we can find aᵢ by finding the eigenvectors of K. Let’s define the kernel function with a Gaussian function. The corresponding latent factor for x can be computed as:


Below is how we make a prediction for a new input x₀ using Kernel PCA.


Cholesky decomposition

The Cholesky decomposition of a Hermitian positive-definite matrix A is


A Hermitian matrix is a square matrix that equals its transpose conjugate. A transpose conjugate takes the complex conjugate of each element and then transpose the matrix.

A covariance matrix is symmetric (a special case for Hermitian if values are all real) and semi-positive definite. Therefore, Cholesky decomposition is commonly used in ML for easier and more stable manipulation.

Moore-Penrose Pseudoinverse

For a linear equation system, we can compute the inverse of a square matrix A to solve x.

But not all matrices are invertible. In ML, it will be unlikely to find an exact solution with the presence of noise in data. But the solution for x can be estimated as,


Statistical Significance

Null hypothesis H₀ is a statement that there is no relationship between two measured phenomena, for example, no correlation between wealth and happiness. This idea can be further extended for statements that we want to nullify and repute. A null hypothesis will be rejected if the observed data is statistically significant, i.e. it is very rare to observe the collected data if the null hypothesis is true. For example, if we see 100 heads in flipping a coin 100 times, we can “nullify” the hypothesis that the coin is fair. Therefore, the alternative hypothesis H₁, one that contradicts with H₀, is likely true (the coin is biased). In practice, it is harder to quantify the relationship between two variables than computing the chance that the collected data happens by chance only. So a null hypothesis serves as a better mean to draw conclusions about two phenomena.

p-value (probability value) is the probability of the observed sample when the null hypothesis is true. A small p-value (typically ≤ 0.05 or ≤ 0.01) shows strong evidence against the null hypothesis, i.e. it is rare for that to be happened by chance.


For example, after collect 100 data points, we can calculate a correlation coefficient based on the data. As shown above, if the correlation is -0.25 for the 100 data points we collected, its corresponding PDF is about 0.012. Only 2.5% of the population may have correlation smaller than -0.2. Therefore, the null hypothesis is likely false.

Confidence interval

After conducting an experiment to collect a sample. We can use the sample data points to estimate a population parameter like the mean (called estimator). A confidence interval can be calculated as a range around this sample mean. A 95% confidence level means that in 95% of those experiments, its confidence interval contains the true mean of the population. In other words, 1 out of 20 chance that the confidence interval of an experiment does not contain the true mean.

Here is the skeleton in calculating the confidence interval for the sample mean.

Modified from source

And the sample variance:

Modified from source

Chi-square test

Chi-square test is a popular test in measuring the likelihood that the correlation in the observed data is by chance only, rather than some correlation between two variables.

We compute the Chi-square value with the formula above. We compare the real count from the sample and the expected count assuming no correlation exists. Start from the Chi-square, we can look up the corresponding p-value from a pre-computed table. If the Chi-square coefficient is close to zero, H₀ is true. Here is an example below in determining whether gender impacts the choice of a pet.

In this example, the p-value corresponding to the calculated Chi-square coefficient is small. Therefore, gender and choice of pet are related.

Exploratory data analysis

To explore the data, we may calculate the covariance between two variables or perform a scatter plot like the one below to spot trends.


For example, the green dots and the blue dots below are houses for SF and NY respectively. If we have a decision stump for elevation > 73 feet, the one satisfies this condition will likely in SF.

Modified from source (but the data can be outdated)


L1, L2-norm

Lp-norm, L∞-norm (max norm) & Frobenius norm


Jaccard Similarity

The Jaccard similarity measures the ratio between the size of the intersection over the size of the union.

Cosine Similarity

Cosine similarity measures the angle between two vectors.

Pearson Similarity

Pearson correlation coefficient ρ measures the correlation between two variables.


Here are some free courses in ML.

Andrew Ng Course

John W. Paisley Course

Nando de Freitas Course

Mark Schmidt Course


Each section in the article can be spawned off to a lengthy article. So congratulation that you make it so far. Next, we will dig into more traditional ML algorithms.