Probability and Machine Learning? — Part 1- Probabilistic vs Non-Probabilistic Machine Learning Models

Gathika Ratnayaka
Nerd For Tech
Published in
7 min readApr 29, 2020

Mathematics is the foundation of Machine Learning, and its branches such as Linear Algebra, Probability, and Statistics can be considered as integral parts of ML. As a Computer Science and Engineering student, one of the questions I had during my undergraduate days was in which ways the knowledge that was acquired through math courses can be applied to ML and what are the areas of mathematics that play a fundamental role in ML. I believe this is a common question among most of the people who are interested in Machine Learning. Therefore, I decided to write a blog series on some of the basic concepts related to “Mathematics for Machine Learning”. In this series, my intention is to provide some directions into which areas to look at and explain how those concepts are related to ML. I am not going deep into the concepts and I believe there are a lot of resources with quite good examples that explain each of these concepts in a detailed manner.

As the first step, I would like to write about the relationship between probability and machine learning. In machine learning, there are probabilistic models as well as non-probabilistic models. In order to have a better understanding of probabilistic models, the knowledge about basic concepts of probability such as random variables and probability distributions will be beneficial. I will write about such concepts in my next blog. However, in this blog, the focus will be on providing some idea on what are probabilistic models and how to distinguish whether a model is probabilistic or not.

What are Probabilistic Machine Learning Models?

In order to understand what is a probabilistic machine learning model, let’s consider a classification problem with N classes. If the classification model (classifier) is probabilistic, for a given input, it will provide probabilities for each class (of the N classes) as the output. In other words, a probabilistic classifier will provide a probability distribution over the N classes. Usually, the class with the highest probability is then selected as the Class for which the input data instance belongs.

However, logistic regression (which is a probabilistic binary classification technique based on the Sigmoid function) can be considered as an exception, as it provides the probability in relation to one class only (usually Class 1, and it is not necessary to have “1 — probability of Class1 = probability of Class 0” relationship). Because of these properties, Logistic Regression is useful in Multi-Label Classification problems as well, where a single data point can have multiple class labels.

Some examples for probabilistic models are Logistic Regression, Bayesian Classifiers, Hidden Markov Models, and Neural Networks (with a Softmax output layer).

If the model is Non-Probabilistic (Deterministic), it will usually output only the most likely class that the input data instance belongs to. Vanilla “Support Vector Machines” is a popular non-probabilistic classifier.

Let’s discuss an example to better understand probabilistic classifiers. Take the task of classifying an image of an animal into five classes — {Dog, Cat, Deer, Lion, Rabbit} as the problem. As input, we have an image (of a dog). For this example, let’s consider that the classifier works well and provides correct/ acceptable results for the particular input we are discussing. When the image is provided as the input to the probabilistic classifier, it will provide an output such as (Dog (0.6), Cat (0.2), Deer(0.1), Lion(0.04), Rabbit(0.06)). But, if the classifier is non-probabilistic, it will only output “Dog”.

Why probabilistic ML models?

One of the major advantages of probabilistic models is that they provide an idea about the uncertainty associated with predictions. In other words, we can get an idea of how confident a machine learning model is on its prediction. If we consider the above example, if the probabilistic classifier assigns a probability of 0.9 for ‘Dog’ class instead of 0.6, it means the classifier is more confident that the animal in the image is a dog. These concepts related to uncertainty and confidence are extremely useful when it comes to critical machine learning applications such as disease diagnosis and autonomous driving. Also, probabilistic outcomes would be useful for numerous techniques related to Machine Learning such as Active Learning.

Objective Functions

In order to identify whether a particular model is probabilistic or not, we can look at its Objective Function. In machine learning, we aim to optimize a model to excel at a particular task. The aim of having an objective function is to provide a value based on the model’s outputs, so optimization can be done by either maximizing or minimizing the particular value. In Machine Learning, usually, the goal is to minimize prediction error. So, we define what is called a loss function as the objective function and tries to minimize the loss function in the training phase of an ML model.

If we take a basic machine learning model such as Linear Regression, the objective function is based on the squared error. The objective of the training is to minimize the Mean Squared Error / Root Mean Squared Error (RMSE) (Eq. 1). The intuition behind calculating Mean Squared Error is, the loss/ error created by a prediction given to a particular data point is based on the difference between the actual value and the predicted value (note that when it comes to Linear Regression, we are talking about a regression problem, not a classification problem).

The loss created by a particular data point will be higher if the prediction gives by the model is significantly higher or lower than the actual value. The loss will be less when the predicted value is very close to the actual value. As you can see, the objective function here is not based on probabilities, but on the difference (absolute difference) between the actual value and the predicted value.

Eq: 1

Here, n indicates the number of data instances in the data set, y_true is the correct/ true value and y_predict is the predicted value (by the linear regression model).

When it comes to Support Vector Machines, the objective is to maximize the margins or the distance between support vectors. This concept is also known as the ‘Large Margin Intuition’. As you can see, in both Linear Regression and Support Vector Machines, the objective functions are not based on probabilities. So, they can be considered as non-probabilistic models.

On the other hand, if we consider a neural network with a softmax output layer, the loss function is usually defined using Cross-Entropy Loss (CE loss) (Eq. 2). Note that we are considering a training dataset with ’n’ number of data points, so finally take the average of the losses of each data point as the CE loss of the dataset. Here, y_i means the true label of the data point i and p(y_i) means the predicted probability for the class y_i (probability of this data point belongs to the class y_i as assigned by the model).

The intuition behind Cross-Entropy Loss is ; if the probabilistic model is able to predict the correct class of a data point with high confidence, the loss will be less. In the example we discussed about image classification, if the model provides a probability of 1.0 to the class ‘Dog’ (which is the correct class), the loss due to that prediction = -log(P(‘Dog’)) = -log(1.0)=0. Instead, if the predicted probability for ‘Dog’ class is 0.8, the loss = -log(0.8)= 0.097. However, if the model provides a low probability for the correct class, like 0.3, the loss = -log(0.3) = 0.523, which can be considered as a significant loss.

Eq. 2

In a binary classification model based on Logistic Regression, the loss function is usually defined using the Binary Cross Entropy loss (BCE loss).

Eq. 3

Here y_i is the class label (1 if similar, 0 otherwise) and p(s_i) is the predicted probability of a point being class 1 for each point ‘i’ in the dataset. N is the number of data points. Note that as this is a binary classification problem, there are only two classes, class 1 and class 0.

As you can observe, these loss functions are based on probabilities and hence they can be identified as probabilistic models. Therefore, if you want to quickly identify whether a model is probabilistic or not, one of the easiest ways is to analyze the loss function of the model.

So, that’s all for this article. I hope you were able to get a clear understanding of what is meant by a probabilistic model. In the next blog, I will explain some probability concepts such as probability distributions and random variables, which will be useful in understanding probabilistic models. If you find anything written here which you think is wrong, please feel free to comment. It would not only make this post more reliable, but it will also provide me the opportunity to expand my knowledge. Thanks and happy reading.

--

--

Gathika Ratnayaka
Nerd For Tech

A researcher who is trying to figure out the true nature of the journey called existence, and the purpose of this city named life.