A Guide to Cost Functions

Sarah Abdelazim
7 min readJan 23, 2023

--

a summary of the most commonly used loss functions in regression and classification models

If you are new to the data science field, you will probably come across these terminologies quite often. The correct use and combination of these functions is what makes a model a good one, so let’s explore what they mean and how they can improve your model!

Introduction

Before we dive into cost functions, let us introduce the two most common types of models: regression and classification. Regression is a supervised machine learning technique used to approximate a mapping function from one (simple linear regression) or more (multiple linear regression) independent variables to a dependent variable. If a non-linear relationship exists between the independent and the dependent variables, polynomial regression can be used instead. While continuous, numeric variables are the most common types of dependent variables, regression models can also be used to predict counts (poisson, quassi poisson and negative binomial regression ). If the dependent variable is a discrete binary label (“0” and “1” or “spam” and “non-spam”), multiple labels (“house”, “townhouse”, “apartment” and “penthouse”) or ordinal labels (labels that follow a specific order e.g. “least likely”, “likely” and “most likely”), the prediction becomes a classification method and can be solved by using a binary logistic regression, multinomial logistic regression or ordinal logistic regression, respectively. These classification methods are still regression models because they are probability-based, which means they provide the probability of each class or label occurring and not just the predicted output. “Real” or actual classification models are not probability-based and mainly only provide the predicted labels, such as K-Nearest Neighbours (KNN), Support Vector Machine (SVM), and Decision Trees.

Cost Functions

Loss functions are the functions that calculate the distance between the predicted output and the observed output and is a method of how well the model fits the dataset. The best loss functions are those whose predictions are very close to the observed output, therefore having a low number for “loss” and vice versa. While loss functions and cost functions are often times used interchangeably, the term loss function is used for a single training example/input while cost function is used for the average loss over the entire training dataset. In machine learning models, we want to minimize the cost function which can only be achieved by reducing the loss associated with each training example, so you can see now how they are all connected. The type of loss function depends on the model you are using (regression or classification) and the prediction problem in question. Let’s explore the different loss functions!

Regression Functions:

  • Loss function #1: Square/L2
Figure 1 — L2 loss function (https://amitshekhar.me/blog/l1-and-l2-loss-functions)
  • Cost function #1: Mean Squared Error (MSE) / Quadratic Cost
FIgure 2— MSE cost function (https://en.wikipedia.org/wiki/Mean_squared_error)
Figure 3 — MSE graph (https://heartbeat.comet.ml/5-regression-loss-functions-all-machine-learners-should-know-4fb140e9d4b0)

L2 Loss is the squared difference between the observed and the predicted output for a single training example while MSE is the average of the squared difference between the observed and the predicted output for all the training examples. This is a common regression metric and the line of best fit for our dataset is that with the least MSE. MSE is sensitive to outliers since it penalizes predicted outputs that are far from the expected outputs exponentially as can be seen in figure 3. This makes MSE a less robust cost function.

  • Loss function #2: Absolute Error/L1
Figure 4— L1 loss function (https://amitshekhar.me/blog/l1-and-l2-loss-functions)
  • Cost function #2: Mean Absolute Error (MAE) / Quadratic Cost
FIgure 5— MAE cost function (https://en.wikipedia.org/wiki/Mean_squared_error)
Figure 6 — MAE graph (https://heartbeat.comet.ml/5-regression-loss-functions-all-machine-learners-should-know-4fb140e9d4b0)

L1 Loss is the absolute difference between the observed and the predicted output for a single training example while MAE is the average of the absolute difference between the observed and the predicted output for all the training examples. MAE is more robust cost function because it penalizes all errors evenly on a linear scale. Hence, MAE is not as sensitive to outliers since it does not put too much weight on our big outliers like MSE does.

  • Loss function #3: Huber Loss/Smooth Mean Absolute Error
Huber loss
  • Cost Function #3: MSE when δ ~ 0 and MAE when δ ~ ∞
Figure 8 — Huber cost function: (https://www.analyticsvidhya.com/blog/2022/06/understanding-loss-function-in-deep-learning/)
Figure 9 — Huber loss graph for varying δ (https://medium.com/analytics-vidhya/a-comprehensive-guide-to-loss-functions-part-1-regression-ff8b847675d6)

Since L2 loss is more stable when the difference between predicted and expected output is small, and L2 is more stable when the difference between the predicted and expected output is big (as in the case with outliers), Huber loss gives the best of both worlds in that it penalizes errors exponentially (quadratic) for small errors and linearly for big errors, hence the overall model lies somewhere between MSE and MAE. The cutoff that determines how small an error in order for L2 loss to be applied is controlled by the hyperparameter δ whose value is very critical. Hence, we have to tune the hyperparameter δ iteratively which requires model training and increases model complexity.

Classification Functions:

Since you all got the gist of the difference between cost and loss functions, I will be mainly focusing on the cost function from now on.

  • Cost function #1: Binary Cross Entropy/Log Loss
Figure 10 — Log loss cost function (https://androidkt.com/choose-cross-entropy-loss-function-in-keras/)
Figure 11 — Log loss graph (https://towardsdatascience.com/understanding-binary-cross-entropy-log-loss-a-visual-explanation-a3ac6025181a)

This is the cost function used in binary logistic regression, which is the negative average of the log of corrected predicted probabilities, where y is the label (“1” for positive class and “0” for negative class) and p(y) is the corrected predicted probability for obtaining the positive class for all N points. Logistic regression fits a sigmoid curve for the data and obtains the predicted probabilities of a point being the positive class. Then, binary cross entropy obtains the corrected predicted probability which is the probability of the point belonging to its original class. If predicted probability = corrected predicted probability, then loss is zero. Since the probability is always between 0 and 1, the log of that is negative and so we get the negative log of the corrected probabilities. If the corrected predicted probability associated with the true class is 1.0, the corresponding loss will be zero and the closer the probability associated with the true class approaches zero, loss increases exponentially.

  • Cost function #2: Multinomial Cross Entropy/Logistic Loss
Figure 11 — Multinomial Logistic cost function

This is a cost function used in multinomial logistic regression when we have a label with multiple classes, e.g. “red”, “green” and blue”. The target values are still binary but represented as a vector y (y = 1 if label is “red” and y = 0 otherwise). Just like binary cross entropy, we calculate a separate loss for each class label per observation and sum the result.

  • Cost function #3: Hinge Loss
Figure 12 — Hinge Loss cost function (https://nehajirafe.medium.com/hinge-loss-machine-learning-bite-size-series-957eade62bcb)
Figure 13 — Hinge loss graph (https://nehajirafe.medium.com/hinge-loss-machine-learning-bite-size-series-957eade62bcb)

A type of loss associated with the “maximum-minimum” classification, also known as SVM classification model, when you want to separate a group of data points from another group. When the sample’s prediction has the same sign as the true label, the loss is zero. That means that correct and confident classifications that are larger than the margin (> 1 in the above graph) do not concur any losses. However, correct but unconfident predictions (between 0 and l in the above graph) are still penalized to avoid the model from making uncertain predictions. If a computed value gives an incorrect prediction, there will always be a hinge loss and this loss is linear as shown in the graph.

Final Words:

Even though the above points do not include the full list of cost functions out there, the knowledge you can gain from this blog is enough for you to train your very first machine learning model. Good luck new data scientists, you can do it!

References:

--

--