Bishop’s PRML book: review and insights, chapters 1–3

Published in

techburst

8 min readNov 4, 2017

Hi all again! Long story short, I decided to repeat basics of classical machine learning, because since famous Andrew Ng’s Machine learning MOOC on Coursera and frequently using scikit-learn library for some time I haven’t done much of that and was exploiting mostly neural networks (because of the tasks I am involved into). I’ve bought the machine learning “bible” on Amazon and even I’ve read just couple of chapters I already have better understanding of a lot of different topics that were obvious to me, but in fact they aren’t. I would like to share with you my insights and the most important moments from the book, you can consider it as a sort of short version.

It might be interesting for more practical oriented data scientists who are looking how to improve theoretical background, for those who want to summarize some basics quickly or for beginners who are just starting.

1. Introduction

Usually introduction is a chapter to skip, but not in this case. Bishop starts with emphasis on Bayesian approach and it will dominate in all other chapters. First interesting moment for me was curse of dimensionality concept. I suppose you know in a form like “the more dimensions we got — the more difficult the problem is”. Bishop is explaining this on much better mathematical example (all illustration are taken from here https://www.microsoft.com/en-us/research/people/cmbishop/ where permisson is granted for non-commercial purposes).

For example we have a very simple classification problem that we can solve just breaking our space into some sub regions and simply count how many points of each class we have there. If region has the majority of points of some class B, then any new point in this region will be assign to the class B. The problem is that when dimension of the data is growing, the number of regions on the grid is growing exponentially.

In 1D we don’t need that much regions to separate data

But in 3D it’s getting much more difficult!

Another interesting property of data in high-dimensional spaces is hidden in geometrical point of view — if we will consider a case of data on some sphere of fixed radius in D-dimensional space, the most of data will be concentrated on a small shell close to the surface! Moreover, if you will solve some exercises in the end of the chapter you can show, that it’s not just close to the surface, but close to the “corners” in case of hypercubes.

One more interesting concept that is often ignored is decision theory. Usually we just train some classifier and tell that if probability is higher than 0.5 it’s class A and so on. But we can also have some important, for example, medical applications, where if we aren’t sure about our prediction we just reject it (reject option).

Reject option example — we would like not to do any decision in some rather unsure situations to minimize misclassification rate

In general, it’s much smarter to split our classification problem in two parts: inference (learning posterior probabilities p(class_i|x)) and decision, where we make class assignments. There are three main ways to do it:

Hard way: solve inference problem for each class and use Bayes’ to find posterior
Medium: solve general p(class_i|x) inference and after apply decision theory with reject regions
Easy (how we usually do): directly learn discriminant mapping f(x) from data to a class

In the end of this chapter we have generalized loss function concept (we will use it soon!) and information theory basic concepts like entropy.

Resume of introduction:

We want to study everything within Bayesian framework
Curse of dimensionality matters
While minimizing the misclassification rate we would like to apply decision theory
Making predictions is about inference and decision theory

2. Probability distributions

After having straight “A” in probability or stats both in high school and in my bachelors I was first pretty skeptical if I have to read it. And I’ve mistaken again. This chapter is amazing bottom-up explanation of all the distributions and their conjugated priors both with likelihood idea. This is the core of Bayesian framework.

The general idea is clear: having some distribution (let’s say Bernoulli with it’s mu) we can set a prior over this mu that is conjugate to the main distribution. After we see more data, we update this prior with some degrees of freedom to get a posterior for mu. Therefore, our main Bernoulli distribution gets more flexible and likelihood function fits the data better.

For example:

Bernoulli -> Beta
Multinomial -> Dirichlet
Gaussian -> Gamma / Gaussian-Gamma

This chapter has a lot of math, because for each distribution we would like to calculate out maximum likelihood estimations for mean and/or variance. Another interesting moment is sequential estimation of them, which leads us to Robbins-Monroe algorithm — if you will check it out, it will remind you good old gradient descent, so you can see where it’s growing from.

Totally new moment for me was existence of distributions for periodic variables, but the idea of switching to polar coordinates is pretty clear — and we get von Mizes distribution:

Same distribution on 2D cartesian plot and on polar plot

There is short subchapter, but very important about mixtures of Gaussians — to me it always was clear visually, but now it is mathematically as well. Don’t be lazy and check it out too.

Ah yes, and all the distributions I have mentioned before are members of exponential family, which is more generalized.

The last part of the chapter is about non-parametric methods to estimate distribution of given data. We cannot always rely just on some Gaussian or Bernoulli if distribution is rather complicated, has a lot of peaks etc. Author introduces kernel density estimators and nearest neighbour methods to estimate them.

I would like to mention as well, that this chapter has a lot math exercises that are at least interesting to try to solve.

Resume of probability distributions:

Core of Bayesian framework: for each distribution parameter we might set a prior and turn it to posterior
Beta, Gamma, Dirichlet distributions as these conjugate priors
You better know what is likelihood and where it’s coming from
Sequential estimation of distribution parameters is almost the same as gradient descent (idealogically)
Totally new for me Wishart, von Mizes distributions
Most of distributions we are used to work with are from exponential family

3. Linear models for regression

In this chapter we generalize idea of linear regression in form of f(w, x) = w_0 + w_1*x_1 to w_0 + w_1 * ф_i(x) + … + ф_n(x), where ф(x) is called basis function and can be defined in lot of different forms from Gaussian basis to wavelets.

When we perform maximum likelihood for Gaussian(y | f(w, x), beta), where f(w,x) is our linear basis function model, and we want to estimate w, we end up with definition of normal equations and where we can apply a idea of Moore-Penrose pseudo-inverse of a matrix. Sequential estimation of w is called least mean squares algorithm.

As for me, the most important part of this chapter is bias-variance decomposition, where having generalized regression loss function and taking it’s expected value we can split it in three main parts:

Natural noise of the data, which is showing us minimal possible achievable value of a loss
Squared bias, a squared difference between desired regression function prediction and average prediction over all possible datasets
Variance, that tells us how solution for this particular dataset varies around the average

After we come to Bayesian linear regression. We still can perform the same maximum likelihood, but… we are not actually interested to wind w itself, because we want to do the predictions, and here we come to predictive distribution.

The following illustration shows how variance of this distribution is changing when we see more data:

Of course, if we have a distribution, we can sample from it as well:

Predictive distribution of Bayesian linear regression has another interesting property — we can derive it’s mean and define in a form of linear combination of kernels multiplied by target variables, where the kernel is called equivalent kernel and such linear combinations are called linear smoothers.

This chapter ends with rethinking of a concept of overfitting. If in frequentist framework it’s a problem of choosing right hyperparameters as weight decay, here it’s a problem of model selection. Once we know posterior over our models, we can build a mixture distribution of them, where weights are posterior probabilities p(M_i|x) of M models. The simplified approximation to this is just using one single most probable model for predictions.

Resume of linear models for regression:

We can generalize our typical linear combination of w_i and x_i with basis functions
Calculating maximum likelihood of linear basis function models leads us to normal equations and, sequentially, least mean squares
Loss function for regression can be decomposed into noise, bias and variance
In Bayesian linear regression we are interested in predictive distribution and samples from it
Fighting overfitting in Bayesian framework is about selecting a most probable model (approximately)

Conclusion

I would like to keep writing about what I’ve learned, do you have some suggestions what to add to the next posts? I would like to put some code examples and maybe a bit more math. What do you think guys? Stay tuned :)

P.S.
Follow me also in Facebook for AI articles that are too short for Medium, Instagram for personal stuff and Linkedin!