Common Loss functions in machine learning for Classification model

Sushant Kumar
Analytics Vidhya
Published in
7 min readSep 21, 2020

To find a correcrt Loss Function for any Algorithmis is bit critical, because inaccurate selection of Loss Function will cause wrong solution and can become a troble maker in optimization of machine learning model.

Machine learning is a pioneer subset of Artificial Intelligence, where Machines learn by itself using the available dataset. For the optimization of any machine learning model, an acceptable loss function must be selected. A Loss function characterizes how well the model performs over the training dataset. Loss functions express the discrepancy between the predictions of the model being trained and also the actual problem instances. If the deviation between predicted result and actual results is too much, then loss function would have a very high value. Gradually, with the help of some optimization function, loss function learns to reduce the error in prediction. In this article, we will go through several loss functions and their applications in the domain of machine/deep learning.

There is no universal loss function which is suitable for all machine learning model. Depending upon the type of problem statement and model, a suitable loss function needs to be selected from the set of available. Different parameters like type of machine learning algorithm, degrees of the percentage of outliers in the provided dataset, ease of calculating derivatives etc. play their role in choosing loss function.

The loss functions are mainly divided into two major categories of Regression losses and Classification losses. In this article, only Classification losses will be discussed. For know more about Regression losses, go to the link https://medium.com/analytics-vidhya/common-loss-functions-in-machine-learning-for-a-regression-model-27d2bbda9c93.

(Note: Regression function generally predicts a value/quantity, whereas classification functions predict a label/class)

Classification losses:

1.Binary Classification Loss Functions:

In Binary classification, the end result is one of the two available options. It is a task of classification of elements into two groups on the basis on a classification rule.

Generally, the problem is to predict a value of 0 or 1 for the first or second class. It is implemented as a prediction of probability to decide whether the element belongs to the first-class or second class. Binary classification is applied to a practical situation. In many practical binary classification problems, the two groups are not symmetric, and rather than overall accuracy, the relative proportion of different types of errors is of interest.

Following are the few examples of binary classification:

  • Medical testing to determine if a patient has a certain disease or not;
  • Quality control in industry, deciding whether a specification has been met;
  • In information retrieval, deciding whether a page should be in the result set of a search or not.
  • Email spam detection (spam or not).
  • Churn prediction (churn or not).
  • Conversion prediction (buy or not).

1.1 Binary Cross-Entropy

Binary cross-entropy a commonly used loss function for binary classification problem. it’s intended to use where there are only two categories, either 0 or 1, or class 1 or class 2. it’s a loss function that’s utilized in binary classification tasks. These are tasks that answer a matter with only two choices (yes or no, A or B, 0 or 1, left or right). Formally, this loss is up to the common of the specific cross-entropy loss on many two-category tasks.

It measures the performance of a classification model whose output is a probability value between 0 and 1. Cross-entropy loss increases if the predicted probability different from the actual label. So If the predicted probability of an element is .02 in comparison with the actual observation label is 1, it would be bad and result in a high loss value. A theoretically perfect model has a binary cross-entropy loss of 0.

1.2 Hinge Loss

The hinge Loss function is another to cross-entropy for binary classification problems. it’s mainly developed to be used with Support Vector Machine (SVM) models in machine learning. The hinge Loss function is meant to be used with binary classification where the target values are within the set, So use the Hinge Loss function, it must make sure that the target variable must be modified to possess values within the set rather than as just in case of Binary Cross Entropy.

Hinge Loss function

This function encourages examples to possess the right sign. It assigning more error when there’s a difference within the sign between the particular and predicted class values.
Reports of performance with the hinge loss are mixed, sometimes leading to better performance than cross-entropy on binary classification problems.

1.3 Squared Hinge Loss

The hinge loss function has many extensions, often the subject of the investigation with SVM models. The squared hinge loss is a loss function used for “maximum margin” binary classification problems. Mathematically it is defined as:

Squared Hinge Loss

A popular extension is called the squared hinge loss that simply calculates the square of the score hinge loss. It has the effect of smoothing the surface of the error function and making it numerically easier to work with.

If using a hinge loss does result in better performance on a given binary classification problem, is likely that a squared hinge loss may be appropriate. As with using the hinge loss function, the target variable must be modified to have values in the set {-1, 1}.

Use the Squared Hinge loss function on problems involving yes/no (binary) decisions and when you’re not interested in knowing how certain the classifier is about the classification (i.e., when you don’t care about the classification probabilities). Use in combination with the tanh() activation function in the last layer.

2.Multi-class Classification Loss Functions

Multi-Class classification is those predictive modelling problems where examples are assigned one of more than two classes.

The problem is often framed as predicting an integer value, where each class is assigned a unique integer value from 0 to n (depending upon the number of class). The problem is often implemented as predicting the probability of the example belonging to each known class.

Following are the few examples of the multi-class classification problem

  • Face classification.
  • Plant species classification.
  • Optical character recognition.

2.1Multi-class Cross Entropy Loss

Same as in binary class classification problem, here also Cross-Entropy is a commonly used loss function. In this case, it used for multiclass classification where the target element is in the set of {0,1,……..,n}. In this case, each class is assigned to a unique integer value.

Multiclass Cross-Entropy loss

Cross-Entropy will calculate a score that will give a value which signifies how predicted value is different from the actual label for all classes in the problem. Theoretically, a perfect cross-entropy value must be 0.

It is the loss function to be evaluated first and only changed if you have a good reason.

2.2 Kullback Leibler Divergence Loss

In mathematical statistics, the Kullback–Leibler divergence (also called relative entropy) is a measure of how one probability distribution is different from a second, reference probability distribution.

Basically its a distance measure between two probability distributions: In case of ML probability scores over class labels returned by model and ground truth probability distribution.

2.3 Sparse Multiclass Cross-Entropy Loss

People generally have confusion in understanding categorical_crossentropy and sparse_categorical cross-entropy.

Both, categorical_crossentropy and sparse_categorical cross-entropy are used in multi-class classification.

The main difference is the former one has the output in the form of one-hot encoded vectors whereas the latter has it in integers. The sparse version can also help you when you encounter memory constraint issues are used in multi-class classification. Sparse cross-entropy addresses this by performing the same cross-entropy calculation of error, without requiring that the target variable be one-hot encoded before training.

References

Note

Also if you are a beginner in machine learning and enthusiast to learn more, then you can search for GitHub account sushantkumar-estech or can use the link https://github.com/sushantkumar-estech for interesting projects

Select any project from your wish for practice and in case of any question, you can write to me. I would be happy to help.

Enjoy reading and happy learning!!

--

--

Sushant Kumar
Analytics Vidhya

Researcher | Engineer | Programmer | Reader | Interested in learning AI, Edge Computing | ASICs, FPGA enthusiast |