Comprehensive Guide to Machine Learning Loss Functions and Evaluation Metrics Explained: PART 1 (Regression).

Rakeshbobbati
12 min read6 days ago

Loss functions in machine learning are mathematical functions used to quantify the difference between the predicted values produced by a model and the actual target values. They play a crucial role in the training process of machine learning models by guiding the optimisation algorithm in adjusting the model parameters to minimise this difference, thereby improving the model’s performance.

Here are some commonly used loss functions across different types of machine learning tasks:

MAE (mean absolute error):

Mean Absolute Error (MAE) is a widely used loss function for regression tasks that measures the average magnitude of the errors between predicted values and actual values without considering their direction. It is an intuitive and straightforward metric that quantifies how close predictions are to the actual outcomes.

Formula

The formula for Mean Absolute Error is:

where: (yi​) represents the actual values. (y^i) represents the predicted values.

Gradient of MAE

For optimisation purposes, such as using gradient descent, it is important to understand the gradient of the MAE with respect to the predictions (y^​i​). The gradient (or derivative) of the MAE with respect to (y^​i​) is given by:

This gradient indicates the direction and magnitude of the change needed to minimise the error for each prediction.

Characteristics of MAE

  1. Absolute Differences: MAE calculates the average of the absolute differences between actual and predicted values, which means it treats all errors equally, without squaring them.
  2. Scale Dependency: The value of MAE depends on the scale of the data. If the data values are large, the MAE will also be large, and vice versa. This makes it crucial to consider the context of the data when interpreting MAE.
  3. Linear Error Response: MAE provides a linear score that does not amplify errors, unlike Mean Squared Error (MSE), which squares the errors. This linearity makes MAE less sensitive to outliers compared to MSE.
  4. Interpretable: MAE is easy to understand and interpret as it directly represents the average error in the same units as the original data.

Advantages

  • Robustness to Outliers: Since MAE does not square the error terms, it is less sensitive to outliers compared to MSE. This makes it more robust in datasets with anomalies or extreme values.
  • Interpretability: The result of MAE is in the same unit as the target variable, making it easier to interpret in practical terms.

Disadvantages

  • Non-differentiability at Zero: MAE is not differentiable at zero, which can pose challenges for some optimisation algorithms that rely on gradient descent. This issue can be mitigated using techniques like sub-gradient methods.
  • Equal Weight to All Errors: MAE assigns equal weight to all errors, which might not always be desirable, especially in cases where larger errors should be penalised more severely.

MAE is a simple yet effective loss function for regression tasks. Its robustness to outliers and interpretability make it a popular choice in many applications. However, its limitations, such as non-differentiability at zero and equal weighting of all errors, should be considered when selecting it for specific use cases.

Mean Squared Error (MSE)

Mean Squared Error (MSE) is one of the most commonly used loss functions in regression tasks within machine learning. It measures the average of the squares of the errors — that is, the average squared difference between the predicted values and the actual values. MSE is particularly useful because it heavily penalises larger errors, making it a sensitive measure for model performance.

The formula for Mean Squared Error is:

where:

  • n is the number of observations.
  • (yi​) is the actual value for the i-th observation.
  • (y^​i​) is the predicted value for the i-th observation.

Characteristics of MSE

  1. Squared Differences: MSE calculates the average of the squared differences between actual and predicted values, which means it amplifies larger errors more than smaller ones.
  2. Scale Dependency: Similar to MAE, the value of MSE depends on the scale of the data. Larger values in the data will result in a larger MSE.
  3. Differentiability: MSE is differentiable, making it suitable for optimisation algorithms that rely on gradient descent, as the derivative of the loss function can be easily calculated.
  4. Emphasis on Larger Errors: Since the errors are squared, MSE places more emphasis on larger errors, making it more sensitive to outliers compared to MAE.

Advantages

  • Differentiability: The MSE function is differentiable, which is advantageous for gradient-based optimisation algorithms.
  • Penalises Larger Errors: By squaring the errors, MSE penalises larger errors more than smaller ones, which can be beneficial when large errors are particularly undesirable.

Disadvantages

  • Sensitivity to Outliers: The squaring of errors makes MSE very sensitive to outliers. A single large error can disproportionately affect the MSE, leading to skewed performance metrics.
  • Interpretation: Unlike MAE, the result of MSE is not in the same unit as the original data (it’s in squared units), which can make interpretation less intuitive.

In machine learning model training, particularly for regression tasks, MSE is often used as the loss function to optimise. During training, the model parameters are adjusted iteratively to minimise the MSE, thereby improving the model’s predictive accuracy. The gradient of the MSE with respect to the model parameters is calculated and used to update the parameters in the direction that reduces the MSE.

Mean Squared Error is a fundamental loss function in machine learning, especially for regression tasks. Its sensitivity to larger errors can be both an advantage and a disadvantage, depending on the context. While it is easy to implement and differentiable, making it suitable for gradient descent optimization, its interpretation can be less intuitive due to the squaring of errors. Overall, MSE is a powerful tool for model evaluation and training, but care must be taken to handle outliers appropriately.

Huber Loss

Huber Loss is a loss function used in regression that combines the best properties of Mean Squared Error (MSE) and Mean Absolute Error (MAE). It is less sensitive to outliers in data than the squared error loss, making it more robust for regression tasks with noisy data. The Huber Loss is quadratic for small errors and linear for large errors, controlled by a parameter δ.

Formula

The Huber Loss is defined as:

where:

  • a= (yi​−y^​i​) is the residual (difference between the actual value and the predicted value).
  • δ is a threshold parameter that determines the point at which the loss function transitions from quadratic to linear.

For a dataset with n observations, the Huber Loss can be calculated as:

Characteristics of Huber Loss

  1. Robustness to Outliers: The Huber Loss is less sensitive to outliers compared to MSE because it transitions to a linear loss for large errors, reducing the impact of these large deviations.
  2. Quadratic for Small Errors: For errors smaller than δ, the Huber Loss behaves like MSE, providing a smooth and continuous gradient which is advantageous for optimisation.
  3. Linear for Large Errors: For errors larger than δ, the Huber Loss behaves like MAE, mitigating the influence of outliers by not squaring the large errors.

Advantages Over MSE and MAE

  1. Combines the Strengths of MSE and MAE: MSE penalises large errors more than MAE, making it sensitive to outliers. MAE treats all errors equally, which can be less informative for optimisation. Huber Loss balances these by being quadratic for small errors (like MSE) and linear for large errors (like MAE).
  2. Differentiability: Huber Loss is differentiable everywhere, including at the transition point, unlike MAE which is not differentiable at zero. This property makes it suitable for gradient-based optimisation algorithms.
  3. Robustness to Outliers: The linear behaviour for large errors reduces the impact of outliers, making Huber Loss more robust than MSE in the presence of noisy data.
  4. Parameter Tuning (δ): The parameter δ allows flexibility to control the sensitivity of the loss function to outliers. By adjusting δ, one can tune the loss function to behave more like MSE or MAE depending on the specific requirements of the problem.

Huber Loss is a versatile and robust loss function that effectively combines the strengths of MSE and MAE. Its ability to handle outliers while providing a smooth gradient makes it an excellent choice for many regression tasks. By tuning the δ parameter, Huber Loss can be adapted to different types of data and requirements, offering flexibility and robustness in model training and evaluation.

Log-Cosh Loss Function

Log-Cosh Loss is a loss function used in regression tasks, providing a balance between Mean Squared Error (MSE) and Mean Absolute Error (MAE). It is based on the hyperbolic cosine function and is designed to be both robust to outliers and smooth for optimisation purposes. The function is particularly useful when the distribution of errors includes a mix of small and large errors.

Formula

The formula for Log-Cosh Loss is:

where:

  • cosh⁡(x)=(e^x+e^(−x) )/2 is the hyperbolic cosine function.
  • yi​ is the actual value.
  • y^​i​ is the predicted value.
  • n is the number of observations.

The Log-Cosh Loss is defined as the logarithm of the hyperbolic cosine of the prediction error. The hyperbolic cosine function approximates x²/2​ for small values of x and ∣x∣−log⁡(2) for large values of x. This means the Log-Cosh Loss behaves similarly to the squared error for small errors and to the absolute error for large errors.

Characteristics of Log-Cosh Loss

  1. Smooth and Differentiable: Log-Cosh Loss is smooth and differentiable, making it suitable for optimisation with gradient descent algorithms.
  2. Robustness to Outliers: It is less sensitive to outliers compared to MSE, as it does not penalise large errors as heavily.
  3. Balanced Behaviour: Combines the quadratic nature of MSE for small errors and the linear nature of MAE for large errors, providing a balanced approach.

Advantages

  1. Smooth Gradient: The Log-Cosh Loss provides a smooth gradient, making it more stable and reliable for optimization algorithms. This can lead to faster convergence and better model performance.
  2. Robustness to Outliers: By approximating MAE for large errors, Log-Cosh Loss reduces the influence of outliers on the model, leading to more robust regression results.
  3. Balanced Error Treatment: The loss function treats small and large errors differently, effectively balancing the advantages of both MSE and MAE. This helps in handling data with a mix of error magnitudes.

Disadvantages

  1. Computational Complexity: The computation of the hyperbolic cosine function and its logarithm is more complex compared to the simple arithmetic operations in MSE and MAE. This can lead to slightly increased computational overhead.
  2. Less Intuitive Interpretation: The Log-Cosh Loss values are less intuitive to interpret compared to MSE and MAE, as they do not directly correspond to average error magnitudes in the original data units.

Log-Cosh Loss is applicable in various regression tasks, particularly when:

  • The data contains outliers that need to be handled robustly.
  • A smooth and stable loss function is required for gradient-based optimisation.
  • A balance between penalising small and large errors is desired.

Log-Cosh Loss is a versatile and robust loss function that effectively balances the properties of MSE and MAE. Its smooth gradient and reduced sensitivity to outliers make it an excellent choice for many regression tasks. While it may have slightly higher computational complexity and less intuitive interpretation, its advantages often outweigh these drawbacks, especially in scenarios with a mix of small and large errors.

Evaluation Metrics in Regression

Regression models predict continuous outcomes, and evaluating their performance requires specific metrics that measure how well the predictions match the actual values. The choice of metric can influence model selection and tuning, and understanding each metric’s characteristics is crucial for making informed decisions. Here’s a detailed view of common evaluation metrics used in regression

Mean Squared Error (MSE)

It can also be used to evaluate a Model. Though it is not robust to outlier, it can be selected with product need.

Explanation:

  • MSE measures the average of the squares of the errors between predicted and actual values.
  • It penalizes larger errors more heavily due to squaring the differences.

Advantages:

  • Differentiable, making it suitable for gradient-based optimization.
  • Sensitive to large errors, which can highlight significant model inaccuracies.

Disadvantages:

  • Highly sensitive to outliers.
  • The value is not in the same unit as the original data, making interpretation less intuitive.

Root Mean Squared Error (RMSE)

Most used evaluation metric in linear regression fit. RMSE is the square root of MSE, bringing the error metric back to the same unit as the original data.

Advantages:

  • Easier to interpret than MSE because it’s in the same unit as the target variable.
  • Penalises large errors but less sensitive than MSE due to the square root.

Disadvantages:

  • Still sensitive to outliers.
  • Can be less intuitive if the data has large variations.

Mean Absolute Error (MAE)

MAE measures the average of the absolute errors between predicted and actual values. It treats all errors equally without squaring them.

Advantages:

  • Robust to outliers since it does not square the errors.
  • The value is in the same unit as the original data, making it easy to interpret.

Disadvantages:

  • Not differentiable at zero, which can complicate optimisation.
  • Less sensitive to larger errors compared to MSE and RMSE.

Mean Absolute Percentage Error (MAPE)

MAPE measures the average absolute percentage error between predicted and actual values.It provides an intuitive percentage error metric.

Advantages:

  • Easy to interpret as it gives a percentage error.
  • Useful for comparing model performance across different datasets.

Disadvantages:

  • Undefined when actual values are zero.
  • Can be misleading if actual values are very small, leading to very high percentage errors.

R-squared (Coefficient of Determination)

R-squared measures the proportion of variance in the dependent variable that is predictable from the independent variables.It compares the model’s performance to that of a simple mean-based model.

Advantages:

  • Provides a normalised measure (between 0 and 1) of model fit.
  • Easy to interpret: a higher R-squared indicates a better fit.

Disadvantages:

  • Can be misleading for small sample sizes or models with many predictors.
  • Does not indicate whether the model predictions are biased or whether the model is the best among different models.

Adjusted R-squared

Adjusted R-squared adjusts the R-squared value based on the number of predictors in the model. It penalises the addition of non-significant predictors.

Advantages:

  • More accurate than R-squared for models with multiple predictors.
  • Prevents overestimation of model performance.

Disadvantages:

  • Can still be misleading if the model includes irrelevant predictors.
  • Not always easy to interpret if the sample size is very small.

Explained Variance Score

Measures the proportion of variance explained by the model. Similar to R-squared but does not penalise the model for the number of predictors.

Advantages:

  • Provides a normalised measure (between 0 and 1) of model fit.
  • Useful for comparing models.

Disadvantages:

  • Does not penalise for overfitting.
  • Can be less intuitive for small datasets.

Conclusion

Choosing the right evaluation metric for a regression model depends on the specific context and goals of the analysis. While MSE and RMSE are sensitive to outliers and useful for highlighting large errors, MAE offer robustness. R-squared and Adjusted R-squared provide insights into the model fit, and metrics like MAPE offer interpretability in percentage terms. Understanding the strengths and limitations of each metric is essential for making informed decisions in model evaluation and selection.

In The next topic I will discuss about loss functions of classification models as well as clustering models and their evaluation metrics.

--

--

Rakeshbobbati

Love to derive Business Impact through Data Science and Analytics. A data Scientist with a 6 years of experience.