Deep Dives into Computer Science - Medium

How to design a high-performance neural network on a GPU

Kiran Achyutuni — Sat, 11 Jul 2020 04:50:27 GMT

GPUs are essential for machine learning. One could go to AWS or Google Cloud to spin up a cluster of these machines with ease. NVIDIA has industry-leading GPU with Tensor cores with the Volta V100 and Ampere A100 accelerators. Which one to pick for the best performance? What configuration? Which works best for your neural network? There are a number of questions that the machine learning architect must deal with in order to design the fastest neural network at the lowest cost. Besides, just using a machine with GPU and Tensor Cores does not guarantee peak performance. So, as a machine learning architect, how should one approach this problem? For sure, you cannot be hardware agnostic. You need to aware of the capabilities of the hardware in order to get the most performance at the lowest cost.

As a machine learning architect, how should you design the neural network to maximize the performance on a GPU?

In this article, we will deep dive to understand the levers that the machine learning architect has to maximize performance. In particular, we will focus on matrix-matrix multiplication in this article because it is the most frequent and heavy-duty (O(n³)) mathematical operation in machine learning.

Let’s start with a simple fully connected 1 hidden layer neural network:

Figure 1: Matrix multiplications at each layer of the neural network with the shapes of the matrix multiply at each step indicated in parenthesis. For example, (B, L1) is the shape of the matrix with B rows and L1 columns. MM1, MM2, … MM5 are various matrix-matrix multiplications.

From the basic neural network, we can see that at layer L2, we perform 3 matrix-matrix multiplications (1 forward, and 2 backward). At layer L1, we perform 2 matrix-matrix multiplications (1 forward, and 1 backward). In fact, at every layer other than the first layer (L1), we perform 3 matrix multiplications. If there are n layers in a neural net, there are 3n-1 matrix-matrix multiplications, i.e., it increases linearly as the size of the neural network.

One quick observation is the case when batch size B=1, i.e., we learn one data point at a time. In this case, matrix-matrix degenerates to matrix-vector multiplication. However, in practice, the batch size is never 1. In Gradient Descent, the entire dataset is considered during each learning step, whereas in Stochastic Gradient Descent, a batch B > 1 (but much less than the entire dataset) is considered at each learning step.

Going forward in this article, let’s focus on a single matrix-matrix multiplication between 2 matrices A and B of dimensions (M, K) and (K, N) respectively resulting in a matrix C of dimension (M, N).

The dimensions M, N, K are determined by the architecture of the neural network at each layer. For example, in AlexNet, the batch size is 128 with a few dense layers of 4096 nodes and an output layer with 1000 nodes. This will result in multiplication of (128,4096) and (4096, 1000) matrices. These are pretty large matrices.

Figure 2. Source: Reference 1. Tiled matrix multiplication

What does “large” mean and how are these matrices multiplied? By “large”, we mean any matrix that does not fit in memory. Let’s dive deeper into large matrix multiplication. The matrix multiplication we have learnt in textbooks assumes that the matrix fits in memory. In reality, this may not be the case, especially in machine learning. In addition, finely tuned matrix multiplication algorithms must take into account the memory hierarchy in the computer for optimal performance. The workhorse for multiplying matrices that don’t fit in memory is the tiled/blocked matrix multiplication algorithm. In tiled/block matrix multiplication, the matrices are split into smaller tiles/blocks that will fit into memory, and then compute the portion of the resultant product matrix (See figure 2). Figure 3 shows how the tiled/block matrix-multiplication is recursively applied at every level of the memory hierarchy.

Figure 3: Tiled/Block matrix-matrix multiplication recursively applied through the complete memory hierarchy of an NVIDIA CPU-GPU system. GEMM stands for General Matrix Multiplication. Figure source: AnandTech.

We won’t get into the precise tiled matrix multiplication algorithms here, the interested reader is referred to this paper. The library routine in BLAS for General Matrix Multiplication is called GEMM. NVBLAS is the Nvidia implementation of GEMM that takes advantage of the internal GPU architecture and implements the tiled/block matrix multiplication. PyTorch and TensorFlow link to this library on machines with Nvidia GPU. The library does all the heavy lifting for you. But a poorly designed neural network will almost certainly reduce the performance.

For us, the immediate goal is to figure out the conditions under which matrix-matrix multiplication executes as fast as possible. This is possible only if the GPU is 100% busy and not idling for data tiles. For this, we need to look at the memory hierarchy and how fast data can potentially move through the memory hierarchy levels.

Figure 4: Roofline Model

Memory hierarchy offers key advantages for improving performance: 1) they hide the latency differences between CPU, GPU, memory components, 2) they take advantage of program locality. Also, a well-designed memory hierarchy offers the highest performance at the lowest cost/byte. To keep the GPUs continuously busy, the data tiles have to rapidly fed to the GPUs. This is determined by the data transfer bandwidth and how fast the GPU is processing the data. This performance metric is captured by the ops: bytes ratio in the Roofline model (Figure 4). Figure 5 shows how to calculate this from the vendor specs. We see that the ops: bytes ratio is 139 for Volta V100, and 416 for Ampere A100. The larger the ops: bytes ratio, more speedup is possible if the computation was memory or arithmetic bound earlier. In other words, a system with higher ops: bytes ratio is more powerful than a lesser one. This is why Ampere A100 is more powerful than Volta V100.

Figure 5: Calculating the ops: bytes ratio from the vendor specs.

What does the ops: bytes ratio mean for machine learning and matrix multiply? To see this, we must now look at the compute and data requirements of matrix multiply. Arithmetic intensity is defined as the ratio of the floating-point operations/sec to the bytes. Figure 6 shows how to compute the arithmetic intensity.

Figure 6: Computing the arithmetic intensity for matrix multiplication

If arithmetic intensity > ops:bytes, then matrix multiplication is arithmetic bound else it is memory bound.

Therefore, the first take away is that the dimensions of the matrices involved should be designed such that the arithmetic intensity is greater than ops: bytes. This will ensure that the GPU is fully utilized. For example, if batch size = 512, N=1024, and M=4096, the arithmetic intensity will be 315, which is greater than 139 for the Volta V100 GPU. Therefore, this matrix multiplication is arithmetic bound on Volta V100 and the GPU will be fully utilized. Figure 7 shows the arithmetic intensities for some common operations in machine learning. The second row corresponds to batch size = 1. The linear layer becomes memory bound and not arithmetic bound in this case. This is the reason why a batch size of 1 is generally not used in production machine learning algorithms.

Figure 7. Arithmetic intensities for some common operations in machine learning. Source: NVIDIA documentation. See Reference 1.

However, ensuring the right arithmetic intensity by choosing the right matrix dimensions is not sufficient to achieve peak arithmetic performance which requires keeping all the tensor cores busy as well. In order to effectively use Nvidia’s Tensor Cores, M, N, K must be a multiple of 8 for FP16 arithmetic or multiple of 16 for FP32 arithmetic. Nvidia core libraries check the dimensions of the matrices and if the conditions are satisfied, then the operations are routed to the Tensor cores. This can result in a 6x speedup on Volta using Tensor cores as compared to using without the Tensor cores. Therefore, the second takeaway is that if the dimensions are not a multiple of 8 or 16, then it is recommended that the dimensions are padded appropriately.

In your quest to speed up the performance as a machine learning architect, you will invariably face the decision whether to upgrade from Volta to Ampere and pay a higher cost. To this end, you must determine if your neural network is either arithmetic bound or memory bound using the Roofline model. If it is neither, then there is no value in upgrading to a more powerful machine. This is the third takeaway. Nvidia offers tools such as Nsight Compute to perform application analysis.

To summarize:

matrix-matrix multiply is the most frequent operation in neural network training and inference. The number of matrix multiplications is almost 3n the number of layers in a neural network. Hence it is important to compute these as fast as possible.
The matrices are very large in a neural network. Therefore, we will invariably use a GPU to speed up the matrix multiplications. In order to do, we must understand the ops: bytes ratio of the GPU and design the layers in such a way that the arithmetic intensity is greater than the ops: bytes ratio if possible.
In order to achieve peak arithmetic performance such that all the Tensor cores are used, the dimensions of the matrices must also meet the requirements imposed by the NVIDIA architecture to use the Tensor core. Usually, it is a multiple of 8 (for FP16 arithmetic) or 16 (for FP32 arithmetic). It is best to see the documentation to ensure the requirement is met.
You should determine if the application is memory bound or arithmetic bound. If it is neither, then there is no point in upgrading to a more powerful GPU. Otherwise, we can accelerate further by upgrading.

Being aware of the hardware capabilities and the requirements it imposes for maximizing performance will help in choosing the matrix dimensions and batch size judiciously. This will lead to a design of the neural network such that the training can be completed in the shortest possible time at the lowest cost.

This article touched on one aspect of accelerating machine learning. There are other dimensions to this problem as well. We will look at some more in the future. Hope you found this article useful. Happy architecting.

References

Nvidia docs: Deep Learning performance documentation
Nvidia: It’s all about Tensor Cores, AnandTech Blog 2018
Fast Implementation of DGEMM on Fermi GPU, Proceeding of the 2011 International Conference on High Performance, Storage, and Analysis
Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures, Comm. of the ACM, April 2009
Performance Analysis of GPU-Accelerated Applications using the Roofline Model, NVIDIA, 2019
Intel Advisor Roofline Analysis
Nvidia Nsight Compute
Nvidia Ampere A100 Datasheet
Nvidia Volta V100 Datasheet

How to design a high-performance neural network on a GPU was originally published in Deep Dives into Computer Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Deep dive into the internals of NumPy's linalg.polyfit()

Kiran Achyutuni — Thu, 21 May 2020 10:04:53 GMT

Image Source: Unsplash: Willian Justen de Vasconcellos

Today, a linear regression program in machine learning can be written in just a few lines so easily that we can fail to appreciate the depth of the progress in math and computer systems that enables this simplicity for us. Behind the powerful numpy and scipy libraries, a huge amount of math and numerical complexity is hidden from us. In this article, we do a deep dive into the internal implementation of linear regression to get a better understanding of what exactly is going on under the covers. In the process, we will discover a variety of elegant linear algebra techniques, the complexity of calling Fortran functions from python, etc.

In a linear regression problem, we need to solve Ax = b. Here (A, b) is the set of observed data, and we need to find the parameters x that fits this data. There are innumerable mathematical ways by which we can solve the system of linear equations. We are interested in the internal mathematical methods that underly machine learning packages such as numpy/scipy.

Let’s start with an example of how numpy.linalg.polyfit() is used. (You can download the python code at Github linear regression using numpy polyfit).

https://medium.com/media/5c03141b7a1bb00d03bb3651944d6fec/href

Here is the visualization of the result. The “*” is the observed data (or generated in our case) and the line is drawn by the parameters that were computed by polyfit. Clearly, polyfit is doing a good job of linear regression.

Most articles on the web stop at this point and don’t go any deeper. However, diving deep and trying to understand what is going on will help us appreciate how the solution was computed. So let’s deep dive into np.linalg.polyfit().

The first observation we make is that np.linalg.polyfit() is not fully implemented in python. The heart of its computation happens in a library that was written in highly optimized FORTRAN code! Let’s see how this is so. The call to polyfit() works its way through the following blocks as shown below. I have cut and paste relevant calls from a variety of files so that you can see how it all stitches together:

In block 2, the call to polyfit() will construct a Vandermonde matrix via a call to numpy.linalg.polyvander(), a special matrix where the columns are in a geometric progression. Assuming the user wants to fit a polynomial of degree 5, then columns up to x⁵ are constructed. Note that the user only needs to supply a 1-D column of x values and polyfit() generates the Vandermonde matrix. By doing so, it is preparing for a polynomial linear regression for this equation:

The original matrix A has transformed as follows:

Note that this is a rectangular matrix with m > n (deg). Depending on the number of data observations, m can be very large whereas n is usually a small number. Once this matrix is set up, preparations begin to call the matrix implementation in LAPACK (linear algebra library).

The LAPACK library was written in Fortran 90 way back in the 1980s and 1990s but is updated regularly. The latest open-source LAPACK version is 3.9.0 and was released in 2019. It is also optimized by processor vendors such as Intel, Apple etc. The Fortran library is compiled into a dynamic library based on the architecture of the platform and comes with the operating system. In addition, when you install numpy, 2 shared objects (.so) also comes along with it to make the calls to the C API easy. To view these files, go to the /lib/python3.7/site-packages/numpy/linalg/ and you will see the following:

The two important files here:

_umath_linalg.cpython-37m-darwin.so. : This file converts python data structures into C data structures so that the C interface to the LAPACK can be invoked.
lapack_lite.cpython-37m-darwin.so: This file is the C interface to the LAPACK routines. It calls the LAPACK routines and converts the returns values back to python data structures. The actual LAPACK routines are provided by the OS. For example, on Apple macOS, you can find the platform optimized version at /System/Library/Frameworks/vecLib.framework/libLAPACK.dylib

Let’s resume how the call flow happens.

The main entry point to computing the least-squares solution is np.linalg.lstsq(), as shown in block 3.
In block 4 in the above diagram, np.linalg.lstsq() calls the C interface in _umath_linalg.lstsq_m().
Blocks 5–8 are templates to generate the python-C interface. For example, in block 6, call_@lapack_func@ becomes call_sgelsd() defined in block 7. In block 7, LAPACK(@lapack_func@)(…) becomes the call to sgelsd_(…). This is an extern function that is implemented in the libLAPACK.dylib provided by Apple.

Now depending on the LAPACK method invoked, there are many ways to compute the linear least squares as shown in block 9.

QR or LQ factorization
Complete Orthogonal factorization
Singular Value Decomposition (SVD) factorization
Divide and conquer SVD

numpy.linalg.lstsq() has chosen to use the divide-and-conquer SVD methods. How do we know? Look in block 6, and you will find these functions that are specified for template generation. So the chosen functions are SGELSD, CGELSD, ZGELSD, and DGELSD.

How does the divide and conquer SVD work? The high-level description of the algorithm is described in block 9 is part of code comments, but you can look at all the details in the original paper. It shows exactly how the least-squares solution is computed. Here is the abstract of the paper:

In order to read the above paper, here are core linear algebra concepts that one needs to know: 1) Singular Value Decomposition (SVD) 2) QR Factorization 3) Moore-Penrose Pseudoinverse matrix. There are a number of good YouTube videos on linear algebra but the one I would recommend is the Gibert Strang Lectures on Linear Algebra (MIT). His textbook is equally fabulous. (I am capturing all the essential math concepts required for machine learning here). From the paper, we see that the current implementation for least squares, i.e, SGELSD, in LAPACK is 9 times faster than its predecessor on bidiagonal matrices! This is the reason why as a first step, SGELSD converts the input matrix to a bidiagonal matrix so that we can benefit from the speedups. Once the SVD is computed, it is trivially easy to compute the least-squares solution. We compute the Moore-Penrose pseudoinverse and then derive the least-squares solution from it. The final step is also detailed in the paper.

In summary, we have traced the complete path of numpy.linalg.polyfit() all the way down to the core Fortran code. We have discovered that it uses important numerical linear algebra techniques such as singular value decomposition. We also looked at the original paper that came up with the algorithms that numpy.linalg.polyfit() uses and found that the algorithms are significantly faster than the prior implementations. To interface Fortran code seamlessly with Python, careful implementation of the C and Python API interfaces were developed so that the process would not crash. High-quality software engineering practices are required to make this robust. In addition, the OS vendors provided highly tuned libraries to make the LAPACK implementation more efficient for their processor architecture. Thus, advances in math, software engineering, computer architectures, and open source have made numpy.linalg.polyfit() possible and fast.

Hope you enjoyed the deep dive journey into nump.linalg.polyfit().

References:

LAPACK — Linear Algebra Package, LAPACK User’s guide, LAPACK Least squares functions
Numpy source code on Github
Developer Reference for Intel® Math Kernel Library 2020 — C Processor vendors like Intel offer optimized libraries for LAPACK and other mathematical libraries.
Intel Math Kernel Library
vecLib: Apple macOS implementation of LAPACK
You can download the source code of LAPACK here to see the Fortran files.
SGELSD Fortran source code
Efficient Computation of the Singular Value Decomposition with Applications to Least Squares Problems.

Deep dive into the internals of NumPy's linalg.polyfit() was originally published in Deep Dives into Computer Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Essential math concepts to understand machine learning

Kiran Achyutuni — Sun, 10 May 2020 12:54:24 GMT

Machine Learning involves lots of mathematics — linear algebra, calculus, and probability and statistics. They work closely with each other in machine learning algorithms. Linear algebra concepts are used to represent and manipulate data and weights. Calculus is used in the iterative optimization algorithms (e.g., stochastic gradient) to arrive at the optimal weights from loss functions. Probability concepts are used to minimize the uncertainty between the optimal model and the real but unknown data generating process.

Understanding these math concepts and how they complement each other is fundamental to understanding why and how machine learning algorithms work. Having a firm grasp of these concepts will help you in your machine learning career. In this article, I list some of the important concepts in each of the math areas.

Important Concepts from Linear Algebra

https://medium.com/media/e30f058a9a27c70230de7c62bd27a0ea/href

Important Concepts from Probability and Statistics

https://medium.com/media/d54ca4cb9ae6ad5beb078e514bc8734a/href

Important Concepts from Calculus

https://medium.com/media/3ccf4ae53280e7c6858f1ef1b7236345/href

Here is a sample of some important research papers in machine learning that have made an impact, and the (pre-requisite) math concepts they use. Familiarize yourself with these math concepts and you will be able to read the paper first-hand and appreciate its contributions.

https://medium.com/media/d505c1b36ac20b639a171d0a0849a0d7/href

(Note: I intend to update this article regularly with more information. So please revisit frequently to be up-to-date).

Here are some excellent books to refresh your math concepts:

Mathematics for Machine Learning by Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong, To be published by Cambridge University Press
Linear Algebra and its Applications, Gilbert Strang (MIT)
Introduction to Probability, Bertsekas and Tsitsiklis

Essential math concepts to understand machine learning was originally published in Deep Dives into Computer Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Like backpropagation, is there forward-propagation as well?

Kiran Achyutuni — Mon, 23 Mar 2020 19:04:33 GMT

By Berland at English Wikipedia — Transferred from en.wikipedia to Commons by JRGomà., Public Domain, https://commons.wikimedia.org/w/index.php?curid=4860262

The role of automatic differentiation in machine learning

Backpropagation is fundamental to machine learning. It is a technique for computing the gradient of a function. Given there is a technique called backpropagation, one wonders if there is a technique called forward-propagation as well? Turns out, there is. Let’s dive into differentiation techniques here to understand which differentiation technique is most relevant for a given machine learning architecture.

The derivative of a function is a fundamental tool of calculus. There are 3 fundamental ways:

Symbolic differentiation: Here, we manipulate mathematical functions and produce closed from derivates. The output is what you would see if you were to compute the derivatives by hand. For example, you can visit wolfram alpha and see symbolic differentiation in action as follows. The problem with symbolic differentiation is expression swell which can result is high inefficiencies.

2. Numerical differentiation: This method involves the use of finite differences to compute the derivative of a function. This is quite a common method and is used in many areas of engineering such as aerospace engineering. fluid dynamics etc. However, one needs to be careful about round-off errors. Hence, these techniques go to a great extent to avoid these problems. In machine learning, such a high level of precision is not required in computing the gradient since the purpose of machine learning is not optimizing the gradient but optimizing the parameters. So the learning method (such as stochastic gradient descent) can eventually learn the most optimal parameters even if the gradient is not accurate. Also, the complexity of a numerical differentiation method such as Euler’s method or Runge-Kutta method is O(n) where n is the number of basic math operations in the function whose derivative needs to be performed (e,g, the function in the neuron such as sigmoid etc). Given that there are a large number of derivatives to be performed in a neural network, an O(n) factor slows down the computation of the gradient. For these reasons, numerical differentiation is not used in machine learning.

3. Automatic differentiation: In this method, the basic insight is that any arbitrary function coded into a computer program can be broken down to basic mathematical functions such as arithmetic, trigonometric, logarithmic functions. Now any function can be constructed as a computational graph consisting of the basic functions. And the derivative (also a function) can be constructed as a computational graph and the chain rule of calculus can be used to compute the derivative at any node in the graph. Both these graphs can be constructed simultaneously as we build out the neural network. In fact, this is what Tensorflow does. Let’s call these graphs the main graph (MG) and the derivative graph (DG).

Before we proceed, let’s recap what we are trying to compute. In stochastic gradient descent (SGD), the machine learning algorithm iterates in batches of examples to learn the weights of the network. During each iteration, we need to compute the Jacobian before we proceed to the next iteration. We are trying to compute the Jacobian as follows:

Here the neural network has n inputs and m outputs. For example, in linear regression with sigmoid output, m = 1 and n ≥ 1. In image recognition neural net like AlexNet, m = 1000 (outputs), and n = 224x224x3 = 150528 (inputs). The gradient in the latter case is a 1000x150528 matrix.

There are 2 modes for automatic differentiation for computing MG and DG:

1. Reverse more accumulation: During each iteration of SGD, evaluate MG first. Compute the loss and then compute the DG on the way back to compute the gradient using the loss. Then the weights are adjusted and the next iteration is performed. Computing DG on the way back is exactly the backpropagation.

2. Forward mode accumulation: Evaluate DG along with MG in each iteration. At the end of the forward step, DG is also available. The gradient can then be applied and the next iteration can be started.

Do note that with either reverse mode or forward mode, the end result is the same — the value of the gradient computed is the same. But the difference is the cost of computation. Why so? Matrix-matrix operations in the forward accumulation mode are more expensive than matrix-vector operations in the reverse accumulation mode (O(n³) vs O(n²) respectively.).

As a rule of thumb, if n > m (i.e, inputs are more than outputs), then backpropagation is more efficient. Otherwise, forward mode accumulation would be better. Therefore, while designing your neural net, you must carefully choose your gradient computation algorithm that is most efficient for your application. In many popular neural nets today, the outputs are far less than the inputs, and hence backpropagation is more popular.

Tensorflow supports both modes of automatic differentiation: Tensorflow Forward accumulator, and Tensorflow Gradient Tape (backpropagation). Both are implemented in software. But this is gradually getting pushed to lower levels of computation (see Intel). There is active research on automatic differentiation (see reference #3 below) and we can expect further advances in the future.

Hope this article gave you insight into why a specific gradient computation is chosen. For those who want to dive in further, please look at the references below.

References:

Like backpropagation, is there forward-propagation as well? was originally published in Deep Dives into Computer Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Is there logic behind probability?

Kiran Achyutuni — Mon, 09 Mar 2020 16:35:38 GMT

Probability theory plays a fundamental role in machine learning. Why? This is because probability theory concerns reasoning when there is uncertainty. In machine learning, we are given some data to learn from, and we need to reason if the machine learning model that we come up with will function well with uncertain and unseen future data. This seems like a silly question to ask, but can we trust probability theory? Are its foundations solid? Believe it or not, for over 300 years there has been a raging debate on this!

In 1713, Bernoulli first came up with some basic rules to figure out the odds in a game of dice that is rolled repeatedly. This was the start of what is known as “frequentist probability theory”. However, there was a limitation in this — namely, what if you could not repeat the experiment repeatedly? For example, what is the probability of the Earth’s temperature rising by 5 degrees by 2030? We cannot apply the “frequentist probability theory” since we cannot repeat the Earth warming experiment again and again. It’s a one-time event. This limitation was noted by Bayes and Laplace very early on. In 1812, Laplace proposed a set of rules as to what is known as “Bayesian probability theory”. In this approach, one takes into account prior knowledge (Bayes’ law) and can “subjectively” assigns a probability to an event based on the available information at that point in time.

Assigning probability “subjectively” and taking prior information into account was the bone of contention between the two camps. The frequentists would not accept the “subjectivity” of the Bayesian approach, and the Bayesian’s would find the frequentists theory too limiting to be applied elsewhere. Yet for some strange reason, both the camps were working off of theorems that were eerily identical but derived in different ways, namely the product and sum rules of probability.

There were 4 important advances in the 20th century that finally put an end to this raging 300-year controversy:

Kolmogorov, 1933 — Foundations of the theory of Probability. He proposed a set of axioms to derive the rules of frequentist probability theory. This is the classical probability theory that we still learn in schools and colleges today. This is easy to understand and hence used to introduce probability to students. Bernoulli’s law is easy to deduce from these axioms. The set of axioms he chose were arbitrary.
Cox, 1946 — Probability, Frequency, and Reasonable Expectation — published in the American Journal of Physics, 1946. Cox showed that the rules of probability can be derived by applying Boolean logic and calculus. This was the first time it was established that probability can be thought of as an extension of boolean logic. Cox had 2 arbitrary axioms as well.
Polya, 1954 — Mathematics and Plausible Reasoning. In this treatise, Polya goes to extensive length in showing how mathematicians and scientists reason when they are in the process of scientific discovery. He came with up with a set of simple rules (syllogisms) to explain the thinking process. This treatise has influenced scientists for many generations since.
Jaynes, 2003 — Probability Theory, The Logic of Science. One issue with both Kolmogorov and Cox is that they are based on axioms that are different but arbitrary. So what is it to say that there won’t be a dozen more theories tomorrow for probability theory. So Jaynes does not start with a set of axioms, but instead, he asks what common sense logic do humans follow while reasoning with uncertainty? He calls these the desiderata. These are not axioms but only requirements that any set of axioms must satisfy if it has to make sense for human reasoning. He is inspired by Cox and Polya in this endeavour and then goes about to derive the fundamental rules of probability from the desiderata using Boolean algebra and calculus. And lo and behold, there are only 2 fundamental rules of probability — the product rule and the sum rule. On top of this, he shows that the axioms which Cox and Kolmogorov arbitrarily assumed can also be derived from the product and sum rule! He also shows that there can be no other set of axioms. If they exist, either they can be derived from these two or it must be inconsistent.

Jaynes and Cox thus showed that probability is an extension of logic since it was derivable from boolean algebra. Also, more importantly, the frequentist and Bayesian probability theories have the same roots (logic). This explains why the product and sum rule are exactly the same in both frequentist and bayesian theories. This diagram shows how everything is interlinked.

Jaynes and Cox showed that it doesn’t matter which camp you are in, both the camps have the same logical origin. However, the Bayesian way of thinking allows probability rules to be applied to a much larger set of problems where experiments cannot be repeated. Hence it is advantageous to use the Bayesian way of thinking.

So where exactly are we applying Bayesian thinking in machine learning? Let’s take the simple example of linear regression.

where w is the weight vector, x is the input example vector, and y is the computed output. Now, we don’t know the exact data process/function f(x) that generated the given data example (y,x). In machine learning, we are trying to figure out f(x). Given the weights w, how certain can we about the computed value y to its true value if we use this data point x? This is where Bayesian way of thinking probability comes in. We can impose any probability distribution we like on this computed value. We normally impose the Gaussian distribution because, among all distributions, this captures most uncertainty (maximum entropy) and least prior assumption about how the data was generated. This is written as

where N is the Gaussian distribution. Do note that we cannot apply the frequentist view here because we are only computing y once for each example x in the training set for a given w. There is no way we can do repeated experiments here. So the only choice we have here is the Bayesian view — i.e., our belief in the amount of uncertainty of y from its true value, which we are capturing by a Gaussian distribution.

Now, over the entire training set, we multiply the uncertainty of each example by applying the product rule of probability theory to give us the following:

Once we formulate this, we can compute the Maximum Likelihood Estimate to find the best w. I would refer you to any machine learning textbook to see how this is done to solve linear regression.

To summarize, we have shown that foundations of probability theory are solidly rooted in logic. Both frequentist and Bayesian approaches have the same logical foundations. However Bayesian approach is more general, powerful, and simpler. The Bayesian view is widely used in machine learning.

Is there logic behind probability? was originally published in Deep Dives into Computer Science on Medium, where people are continuing the conversation by highlighting and responding to this story.