This is story is the sequel to
Chapter 5: Machine learning basics
This story is the summary of my intuition from Deep learning book by Ian Goodfellow, Yoshua Bengio and Aaron Courville
- Random variable (Stochastic variable) — In statistics, the random variable is a variable whose possible values are a result of a random event. Therefore, each possible value of a random variable has some probability attached to it to represent the likelihood of those values.
- Probability distribution — The function that defines the probability of different outcomes/values of a random variable. The continuous probability distributions are described using probability density functions whereas discrete probability distributions can be represented using probability mass functions.
- Conditional probability — This is a measure of probability $P(A|B)$ of an event A given that another event B has occurred.
Frequentist vs bayesian debate
The most simple difference between the two methods is that frequentist approach only estimate 1 point and the bayesian approach estimates a distribution for model weights and a distribution for the labels (more than one point)
Frequentist Linear Regression
The frequentist view of linear regression is probably the one you are familiar with from school: the model assumes that the response variable (y) is a linear combination of weights multiplied by a set of predictor variables (x). The full formula also includes an error term to account for random sampling noise. For example, if we have two predictors, the equation is:
where y is the 1 point of estimation (label) and x is the data points and a is what is called the bias.
Machine Learning in Finance | Data Driven Investor
Before we cover some Machine Learning finance applications, let's first understand what Machine Learning is. Machine…
The goal of learning a linear model from training data is to find the coefficients, β, that best explain the data. In frequentist linear regression, the best explanation is taken to mean the coefficients, β, that minimize the residual sum of squares (RSS). RSS is the total of the squared differences between the known values (y) and the predicted model outputs (ŷ, pronounced y-hat indicating an estimate). The residual sum of squares is a function of the model parameters:
The summation is taken over the N data points in the training set, The closed form solution expressed in matrix form is
This approach is based on the maximum likelihood estimate of beta, now I would like to give you the generic formula for any model based on estimating the maximum likelihood and not just linear regression.
consider data from X0 to X10, below you can find the equation estimating the weights Theta
This product over many probabilities can be inconvenient for various reasons.For example, it is prone to numerical underﬂow/overflow we observe that taking the logarithm of the likelihood does not change its arg max but does conveniently transform a product into sum
so the equation transformed into this form!
So we hear a lot that the model weights should maximize the log likelihood of a certain label this why we hear a “log” into the conversation just to alleviate the underflow problem of values
This approach is simple as that instead of estimating one value for the weights (w) as the former approach but instead we have a set(distribution) of weights that we give an output of a set(distribution) of predictions and assigning them a degree of certainty to those predictions and weights.
Imagine a situation where your friend gives you a new coin and asks you the fairness of the coin (or the probability of observing heads) without even flipping the coin once. In fact, you are also aware that your friend has not made the coin biased. In general, you have seen that coins are fair, thus you expect the probability of observing heads is $0.5$. In the absence of any such observations, you assert the fairness of the coin only using your past experiences or observations with coins.
Suppose that you are allowed to flip the coin $10$ times in order to determine the fairness of the coin. Your observations from the experiment will fall under one of the following cases:
- Case 1: observing $5$ heads and $5$ tails.
- Case 2: observing $h$ heads and $10-h$ tails, where $h\neq 10-h$.
If case 1 is observed, you are now more certain that the coin is a fair coin, and you will decide that the probability of observing heads is $0.5$ with more confidence. If case 2 is observed you can either:
- Neglect your prior beliefs since now you have new data, decide the probability of observing heads is $h/10$ by solely depending on recent observations.
- Adjust your belief accordingly to the value of $h$ that you have just observed, and decide the probability of observing heads using your recent observations.
The first method suggests that we use the frequentist method, where we omit our beliefs when making decisions. However, the second method seems to be more convenient because $10$ coins are insufficient to determine the fairness of a coin. Therefore, we can make better decisions by combining our recent observations and beliefs that we have gained through our past experiences. It is this thinking model which uses our most recent observations together with our beliefs or inclination for critical thinking that is known as Bayesian thinking.
Bayesian learning is now used in a wide range of machine learning models such as,
- Regression models (e.g. linear, logistic, poisson)
- Hierarchical Regression models (e.g. linear mixed effect, pooled/hierarchical regression)
- Mixture models (e.g. Gaussian Mixture models)
- Deep exponential families (e.g., deep latent Gaussian models)
- Linear dynamical systems (e.g., state space models, hidden Markov models)
Before introducing Bayesian inference, it is necessary to understand Bayes’ theorem. Bayes’ theorem is really cool. What makes it useful is that it allows us to use some knowledge or belief that we already have (commonly known as the prior) to help us calculate the probability of a related event. For example, if we want to find the probability of selling ice cream on a hot and sunny day, Bayes’ theorem gives us the tools to use prior knowledge about the likelihood of selling ice cream on any other type of day (rainy, windy, snowy etc.). We’ll talk more about this later so don’t worry if you don’t understand it just yet.
Mathematically Bayes’ theorem is defined as:
How does Bayes’ Theorem allow us to incorporate prior beliefs?
Above I mentioned that Bayes’ theorem allows us to incorporate prior beliefs, but it can be hard to see how it allows us to do this just by looking at the equation above. So let’s see how we can do that using the ice cream and weather example above.
Let A represent the event that we sell ice cream and B be the event of the weather. Then we might ask what is the probability of selling ice cream on any given day given the type of weather? Mathematically this is written as P(A=ice cream sale | B = type of weather) which is equivalent to the left hand side of the equation.
P(A) on the right hand side is the expression that is known as the prior. In our example this is P(A = ice cream sale), i.e. the (marginal) probability of selling ice cream regardless of the type of weather outside. P(A) is known as the prior because we might already know the marginal probability of the sale of ice cream. For example, I could look at data that said 30 people out of a potential 100 actually bought ice cream at some shop somewhere. So my P(A = ice cream sale) = 30/100 = 0.3, prior to me knowing anything about the weather. This is how Bayes’ Theorem allows us to incorporate prior information.
Using Bayes’ theorem with distributions
Until now the examples that I’ve given above have used single numbers for each term in the Bayes’ theorem equation. This meant that the answers we got were also single numbers. However, there may be times when single numbers are not appropriate.
so back to machine learning and using the example of the probability of selling ice cream given one of the prior knowledge that selling ice cream in a sunny day = 0.3 but what if this was only the best guess what if there is a margin of estimation in a range from 0.25 to 0.3, this is what I am talking about just giving a margin of distribution for every estimation and it works as well in linear regression remember the equation? y = bx + c, what if our model estimated that b is in a range from 0.4 to 0.6 and this leads us to range of prediction, so why would I use this, let me tell you when a domain expertise having a data that he thinks that he understands enough that's what is called prior knowledge so giving him a range of best weights and range of best predictions he can now choose what the best parameters(weights) for the problem he is facing , It’s widely used in machine learning. Bayesian model averaging is a common supervised learning algorithm. Naïve Bayes classifiers are common in classification tasks. Bayesian are used in deep learning these days, which allows deep learning algorithms to learn from small datasets.
Converting classical frequentist linear model to bayesian
The difference here is that instead of representing each variable having one value is that now each variable has its own distribution (set of values)
- b— Normal distribution
- a — Normal distribution
it’s just minimizing the dissimilarity of the distribution of the data and the default gaussian distribution of any model. Actually this dissimilarity is measured by KL divergence which is just
log again actually is pretty helpful to use logarithm in order to mitigate the problem we talked about earlier so here divergence is just the difference between the distribution of the data and the model. if you need more info about KL divergence, check this blog
Demystifying KL Divergence
What does KL stand for? Is it a distance measure? What does it mean to measure the similarity of two probability…
In the next blog, We will explore implementing models based on bayesian inference using the Python language and the PyMC3 probabilistic programming framework.
This chapter will be completed in another story where I will talk about supervised and unsupervised algorithms from inside!