The Loss Function Diaries : Ch 3
In the previous chapter I covered two loss functions a) Mean square error(MSE) and b)Binary cross entropy(BCE). I went over the derivation of both fucntions using the maximum likelihood estimation process. This help shed light on the origin of the functions and the reason they are preferably used in specific scenarios.
In this chapter I will cover some more loss functions used for both regression and classification problems. I will go over another method of designing loss fucntions especially for classification problems. So, let’s dive in :)
Regression Loss Functions
So far we have seen MSE which is the most commonly used loss function when it comes to training models for regression problems. Apart from it there are a couple more that I came across. They are as follows:
Mean Absolute Error (MAE)
The absolute error is the L1-norm of the error just like the squared error is the L2-norm of the error. The notable difference between the two is that the L2-norm squares the absolute difference of the prediction and the ground truth. The squaring serves multiple purposes. The first one is that if the absolute errors are minute in quantity say less than or equal to 0.1 then the squaring amplifies the quantity. This prevents it being discarded and allows it to have an effect on the back propagation process. This proves helpful as the weights associated with the predictions responsible for even minute loss are further optimized.

Also, a square function is a smooth convex fucntion which makes it easier for computers to compute the gradient. However, the squaring can prove to be troublesome as it will amplify the errors produced by outliers. The absolute value of the errors of outliers is already large and squaring that large quantity only increases its value which ends up dominating the contribution to the value of the overall loss. This makes the model very sensitive to outliers.

On the other hand the mean absolute error refrains from the use of the squaring the value of the error. This makes it robust to outliers. The formula shows that there is linear correspondance between the weight assigned to the error and the error unlike the squared error function. This means twice as far from the mean will result in twice the penalty.
To summarise:
- MSE is sensitive to outliers, MAE is robust to outliers
- MSE is smooth convex curve, MAE is not
- Easier to compute gradient for MSE compared to MAE
Smooth Absolute Error
This function is the amalgamation of the best parts of MSE and MAE. It computes the loss using the appropriate function. It uses the square error for values less than 1 to ensure that they are not ignored by the model by amplifying them. It uses the absolute error for all other cases. This makes it less sensitive to outliers and prevents the problem of exploding gradients.

Classification Loss Functions
So far we have seen the derivation of the binary cross entropy fucntion and the concept used behind the scenes. The same concept and method can be applied to cases with multiple classes. Through maximum likelihood estimation we can obtain multinomial cross entropy loss function. The activatin used is softmax which is a generalised version of sigmoid function.
In this chapter we go through a loss function called Hinge Loss which is widely used in a model called Support Vector Machines(SVMs). This function is based on a different concept called margin classification. Note that I won’t be going through the details of the working of SVMs. I will briefly go over it to cover the points needed to understand the loss function.
Hinge Loss Function
Classification problems are about creating boundaries to partition data into different class labels. The classification models which give an associated distance from the decision boundary for each example are called margin classifiers . For instance, if a linear classifier is used, the distance (typically euclidean distance, though others may be used) of an example from the separating hyperplane is the margin of that example [2].
SVMs are classifiers that are a representation of the data examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall [3]. Mostly SVMs are used to generate linear separations. For example, in 2D space the separation is a line and in 3D space it is a hyperplane. SVMs are not restricted to the use of linearly classifiable data. There are some kernel tricks that can be used to map data(not linearly classifiable) to higher dimensions to obtain linear separtaion but that’s for another post ;)
The image below depicts how a linear SVM functions. The solid line is the separating boundary between the two classes. The dotted lines are passing through the closest data example of each class is called a support vector. The distance between these support vectors is called the margin. This method of classification aims at maximizing the margin That is trying to create the maximum gap between the two classes from the boundary.

The margin is considered a no-man’s land that is no data point of any class can lie in that region. This type of margin is called hard margin due to the strict retsriction it places on the data examples. Hard margins are only for linearly separable data. It can’t be used for non-linear separations because by definition there aren’t any misclassifications. This is easier to achieve in linear separable data rather than non-linear separations. Also, in case of any outliers the margin is smaller or it may completely fail. Thus, it is very sensitive to noise.
Given all the restrictions of hard margins, the concept of soft margins was developed. These margins use kernels to project the feature set on higher dimensions to obtain linear separations. Soft margin can choose a decision boundary with non-zero training error even if the dataset is linearly separable. Thus, there is less overfitting. The hinge loss function is used for soft margins. The image below shows the function.
hθ(xi) is the prediction of the model and yi {-1,1} is the ground truth label for binary classification.

Hinge loss in soft margin SVMs penalises misclassifications. It penalizes predictions when they are correct but have low confidence. Loss is 0 when signs match and the prediction value is greater or equal to 1 which happens only for correct predictions. The examples below demonstrate this:
Example 1
original label = -1 and prediction score = 0.4 (this means the model predicted class as 1)
penalty = max(0, 1+1(0.4)) = 1.4 which is a very high penalty since the prediction was inaccurate
Example 2
original label = 1 and prediction score =(- 0.9) (this means the model predicted class as -1)
penalty = max(0, 1-1(-0.9)) = 1.9 which is a very high penalty since the prediction was inaccurate
Example 3
original label = 1 and prediction score = 0.7 (this means the model predicted class as 1)
penalty = max(0, 1-1(0.7)) = 0.3 (loss is very less but not 0, since the prediction is accurate and has high confidence but not 100%)
So, hinge loss tries to maximize the margin between the decision boundary and data points and ensures that every point is accurately classified with high confidence. However, one issue with this loss function is that once the margin constraint is specified no further optimizations are done. This is unlike the cross entropy loss functions which keep undergoing optimizations.
For example, the cross-entropy loss would invoke a much higher loss than the hinge loss if our (un-normalized) scores were [10,8,8] versus [10,−10,−10], where the first class is correct. In fact, the (multi-class) hinge loss would recognize that the correct class score already exceeds the other scores by more than the margin, so it will invoke zero loss on both scores. Once the margins are satisfied, the SVM will no longer optimize the weights in an attempt to “do better” than it is already [1].
I will conclude this part here. In this chapter we saw some other loss functions used for the regression and classification problems. We also saw another method of gettin loss functions for classification problems which is margin classifiers.
Stay tuned for more :)
If you like this post or found it useful please leave a clap!
If you see any errors or issues in this post, please contact me at divakar239@icloud.com and I will rectify them.
