Mathematics for Machine Learning — Review (Part I) | by Luckeciano Melo

9 min readFeb 18, 2019

Mathematics for Machine Learning — Review (Part I)

In this post, I would like to share some ideas and opinions about the first part of the book “Mathematics for Machine Learning”. I will cover the Part I of the book, which describes the Mathematical Foundations: Linear Algebra, Calculus and Probability/Statistics.

As mentioned in the book website, the book is not intended to cover advanced machine learning techniques because there are already plenty of books doing this. Instead, it aims to provide the necessary mathematical skills to read those other books.

I found this book on the internet looking for resources to learn and review math concepts for machine learning, in a way that I don’t use a large portion of time but it was possible to understand the principles and abstractions in order to apply for ML research. After reading the first part, I can conclude that it was a great choice. It also provides some exercises or programming tutorials at the end of each chapter.

The first chapter (“Introduction and Motivation”), as its name says, just introduce the book and define some words and concepts that will be covered in it, such as predictor, training, model, and learning. It also suggests two ways to read the book: a bottom-up strategy (“building up the concepts from foundational to more advanced”) and a top-down strategy ( “drilling down from practical needs to more basic requirements”). I assumed the bottom-up strategy because my idea was to build up a solid base for research. Finally, this chapter ends briefly describing all chapters of the book, in approximately two pages.

The foundations and four pillars of machine learning. I will cover the foundations in this post. Source: mml-book.

The chapters 2, 3 and 4 describe the whole area of Linear Algebra, but the authors preferred to divide in three topics: Linear Algebra (the “base”, “general” concepts), Analytic Geometry (the geometric intuitions) and Matrix Decomposition (some operations and matrix decomposition itself). I think the reason is to let each chapter concise and modular, instead of just one monolithic and massive chapter.

In the chapter 2 (“Linear Algebra”), the book starts with very basic concepts from high school math (Systems of Linear Equations , Matrices, Row Echelon Form and the Gaussian Elimination technique, Moore-Penrose pseudo-inverse) to more abstract concepts (Vector Spaces and Subspaces, Linear Independence, Homomorphisms and Affine Mappings).

A mind map for Chapter 2. Source: mml-book.

I found this chapter very complete and helpful to refresh some ideas. The mathematical definitions and concepts are, in general, well defined with the necessary rigor. Furthermore, each sub-chapter has some illustrative examples that facilitates the comprehension. The figures are also very great for understand the abstractions. It also describes the computational implementation and performance of some techniques detailed. I also really liked the fact that very important and interesting properties are described (as a example, you could look for the Rank remark). In this chapter, some exercises are good (requiring a solid understanding of the concepts studied), but some are tedious and repetitive (matrix products, solve linear systems and other direct applications of techniques).

In the chapter 3 (“Analytic Geometry”), the book uses the abstract concepts from the previous chapter to define geometric elements: Norms, Inner Products, Lengths and Distances, Angles and Orthogonality, Projections, and Rotations.

Mind map from Chapter 3. Source: mml-book.

As the previous chapter, this one cover such ideas very well and with the necessary formalism and generalism. In Analytic Geometry, it is important to teach the concepts with generality so that the student does not have a narrow view only of Euclidean space (although the illustrations take place in it). The book is very clear at this point, which is very positive. For example, all operations are described firstly in 2D/3D and then generalized for n-D (rotations inclusive!).

The figures are very helpful again to illustrates all geometric ideas. Additionally, I think that the most important matrices for ML inside analytic geometry are detailed, like symmetric,positive (semi-)definite and orthogonal matrices. It also cover the Inner Product for functions, which is very important for continuous domain.

I found few things that I did not like. The Gram-Schmidt process is superficially mentioned, although I think it is important to solidify the idea of projections and for important methods in traditional ML. Furthermore, sometimes I found the figures are far away from the text that mentions them. It sounds small, but it makes hard to read in an e-book, IMHO :P. Finally, the exercises follows the same idea of the previous chapter, but with substantially less volume.

The last chapter about Linear Algebra is entitled “Matrix Decomposition”, and it is very interesting and important in the context of ML. It starts with some high-school ideas (Trace, Determinants and Invertibility) and then reaches powerful ideas (Eigendecomposition/Diagonalization, SVD, Cholesky Decomposition, and Low rank approximation).

Mind map from Chapter 4. Source: mml-book.

This chapter keeps the quality level from the two previous about Linear Algebra (formalism, intuition, illustrations, nice properties, and exercises as well), but it has another very great quality: it has some illustrative examples of real-world applications (“Eigenspectrum of a biological neural network” , “Google’s PageRank — Webpages as Eigenvectors”, and “Finding Structure in Movie Ratings and Consumers”). It’s so nice to see that such abstract ideas can be directly applied to problems in a insightful mathematical modeling.

The functional phylogeny of matrices (next figure) is a great figure to review all the properties from this chapter!

A functional phylogeny of matrices encountered in machine learning. Source: mml-book.

Chapter 5 is about Vector Calculus and it sounded to me very focused to base optimization theory for ML. It describes (Partial) Derivatives, Taylor Series, Chain Rule, Gradient, Matrix Calculus, Backpropagation and Automatic Differentiation, and a bit about Higher-Order Derivatives and Linearization.

Mind map from Chapter 5. Source: mml-book.

The majority of those ideas can be found in any calculus book. However, the great advantage of this book is the emphasis to matrix calculus operations (which is not so common) and the context of (Deep) Neural Networks. Well, I think it is a good read for who is starting Deep Learning, at least. It is also nice to see the example of gradient in a Least-Squared Loss here.

This chapter is not as formal mathematically as the previous ones (for example, there is not a rigorous definition of continuity or differentiability, topology concepts and some some proofs). Nothing about integrals as well. However, it maintains the conciseness of the book and this kind of formalism can be seen as beyond of scope.

The illustrative examples were a bit more massive for me and sometimes not so clear. In my point of view, the problem is that the intuitions of the ideas from this chapter were not well presented to the reader (like in the previous chapters), but only several operations in a descriptive way.

Finally, I felt a lack of “Computer Science view” in the Automatic Differentiation section. Perhaps if the idea of computational graph were better illustrated (especially in the example) and some basic ideas of the implementation, that part would be richer. Finally, the exercises sounds to be more “mechanical” by the direct application of derivative operations.

Chapter 6 is about “Probability and Distributions”. It covers, as main topics, basic Probability and Statistics, Random Variables, Discrete and Continuous Distributions, Bayes’ Theorem, Summary Statistics, Statistical Independence, Gaussian Distribution, Conjugacy, Change of Variables.

Mind map from Chapter 6. Source: mml-book.

As we can see in the mind map, this chapter covers lots of topics and has several positive points. The way random variables are detailed is great. Furthermore, the link between random variables and Linear Algebra (the geometry induced by the co-variance inner product) is also very interesting. Other exciting idea detailed is about the “topology” of distributions (which live in statistical manifolds), answering the question “Why do we use KL divergence instead of simple Euclidean distance as similarity measure?”. The book also explains Gaussian Distribution very well and with much detail. In general, this chapter is a great resource to base traditional ML algorithms.

As improvements, I would suggest more details (and illustrative examples) in the Bayesian section. The same for Conjugacy section, which was not so clear in my point of view. Additionally, there are several important topics the book does not cover (at least in the part of foundations): Statistical Tests, Central Limit Theorem, Statistical distance (in the book, there is only the mention to KL divergence I described above), Bootstrapping, Confidence Intervals... I think such topics are all very important, especially in terms of Reinforcement Learning and general ML research.

The last chapter of Mathematical Foundations part covers Continuous Optimization. I really liked the idea of bring this topic as essential for ML, although it is not always covered in college (or it is covered superficially during calculus classes). It describes Gradient Descent, Momentum, Constrained Optimization (with Lagrange Multipliers), Convex Optimization, Linear/Quadratic Programming, and Legendre-Fenchel Transform (and Convex Conjugate).

Mind map from Chapter 7. Source: mml-book.

In terms of unconstrained optimization, the focus is on gradient optimization (which is very famous in the context of Neural Networks). In such sections, there is a focus on practical aspects of the optimization, with analysis about computational efficiency in memory/time and stochasticity in terms of batch size. This part, in general, is a boundary between math and machine learning theory (which means you could find these ideas in ML/DL books as well).

As a book about math for ML, I think it needs more details about convergence and the theory about optimization by gradient. For example, considering all calculus in this book, we don’t find anything about the conditions of local/global minima for multivariate cases, the problem of saddle points for gradient optimization and even the practical consequences of convexity. In this perspective, the book fails to build up a intuition about optimization by gradient. Furthermore, there is no link between the Jacobian/Hessian matrices and Linear Algebra, which could be nice to understand the gradient behavior of a function.

In terms of constrained optimization, the problem is clearly defined, and the usage of Lagrange multipliers are illustrated in the sections of Linear/Quadratic programming, where constrained problems are transformed in unconstrained via this technique.

In general, the Part I of Mathematics for Machine Learning is great. I refreshed several ideas, which helped me to strengthen my knowledge and skills of modeling ML problems. After write this blog post, I understand the mission of this book is very challenging. How to write a book focusing in Math for ML — which includes (semi)-supervised/unsupervised/reinforcement learning — and keeping it concise, updated and general enough to a diverse public (composed by data scientists, research engineers and scientists) with distinct objectives? Furthermore, it is very hard to find a balance between the math formalism and the ML knowledge. After all, this book is neither a math book, nor a ML book: it is a Math for ML book.

For the readers of this post, I recommend this book to those who are interested in revisiting college math and do not have the time to reopen the specific books in each of the areas discussed but are not satisfied with the quick treatment given by the good ML/DL books.

If you do not yet have time for this book (after all, Part I has about 200 pages of reading and exercises), I recommend the Part I of Deep Learning book by Goodfellow et al. The reading is more concise, direct and more focused on the main content of the book (Deep Learning). The reading of both books, although repetitive sometimes, is quite complementary.

To finish, keep in mind that my criticism is constructive and is solely my point of view (which is biased for my experience in math and machine learning classes and readings). It is the point of view of a student trying to learning, not from a PhD in the field. I expect this post engage people to read this book and help to construct a solid feedback for the authors :)

Written by Luckeciano Melo