ML Concepts Explained

Basic Metrics to Understand Regression Models in Plain English

Data Science Interviews expect an intuitive understanding of these metrics

Manoj Kumar Dobbali
Towards Data Science

--

Photo by Volkan Olmez on Unsplash

It is easy to remember rules such as its good to have RMSE and MAE should be low, R Squared and other flavors of R Squared values should be high. But, Data Science interviews expect little more from candidates. They don’t ask you if R-Squared value of 0.6 or 0.7 is better. One can expect questions such as which metrics you would use to evaluate a regression model and why that metric? Also, if your role is like an Analytical Translator in your company, you might have to explain complex concepts to business in a simple way. So, this post is about intuitively explaining them instead of providing code. It is easy to find code chunks from sci-kit learn docs or Stackoverflow on how to calculate these scores.

Let us consider a simple linear regression model created with 11 observations(n) which is an unusually low number of examples but it should suffice to demonstrate the point. These observations are represented by orange dots and linear regression equation or best fit line is in green.

Fig 1. Simple Linear Regression Example

From Fig. 1 we can say, the linear regression model is not perfect. There are four points on the line and other points are away from the line in either direction. While the orange dot is the actual value of Y, the point from where blue arrow originates on the regression line is the prediction Ŷ.

Mean Absolute Error (MAE)

If we consider all the orange dots and calculate by how much amount does the prediction misses the actual, we get an error value for that point. That is the difference between Y and Ŷ. To calculate MAE,

  1. Take the absolute difference between Y and Ŷ for each of the 11 available observations: ⎮Yᵢ-Ŷᵢ⎮ where i ϵ [1, the total number of points in the dataset].
  2. Sum each absolute difference to get a total error: Σ⎮Yᵢ-Ŷᵢ⎮
  3. Divide the sum by a total number of observations to get a mean error value: Σ⎮Yᵢ-Ŷᵢ⎮ / n

MAE = Σ⎮Yᵢ-Ŷᵢ⎮ / n

Each observation produces error value which could be any integer. It could be zero, negative or positive. If we simply add these error values together to see the total error, we might end up with a number which doesn’t give a true performance.

Few positive values might bring the error up while few negatives bring the error down finally resulting in a statistic not indicative of model performance. So, we consider only the difference in magnitude of actual and predicted.

Note: There is also Mean Bias Error which is adding all the error values without taking absolutes. I personally never used, so I am skipping.

Mean Squared Error (MSE)

How to calculate MSE?

  1. Take the difference between Y and Ŷ for each of the 11 available observations: Yᵢ-Ŷᵢ
  2. Square each of the difference value : (Yᵢ-Ŷᵢ
  3. Sum Squared values: Σ (Yᵢ-Ŷᵢ)² where i ϵ [1, the total number of points in the dataset]
  4. Divide by the total number of observations: Σ (Yᵢ-Ŷᵢ/ n

MSE = Σ (Yᵢ-Ŷᵢ)² / n

These four steps should give us MSE for that model. But, why are we squaring the difference?

Let’s say you have two models created based on some 1000 examples. For both models, you calculated MAE and found to be exactly the same. But there is an instantly unobservable difference between models. One model has a tiny error value for each observation while the other model has an extreme case of errors, error values are either super high or super low. Which model is better now?

If you are real estate broker and want to provide estimates for a house, you might want your estimates to be off by a bit rather than extremely accurate or inaccurate. In that case, penalizing the model for larger magnitude errors will help us choose the appropriate model. We could do that by calculating MSE.

By squaring the difference between the actual and predicted values, we are able to consider only negative error values and penalize higher error values. Let’s say there are two regression models with error values -1,- 2, 3, 2(Model A) and 1,-5, 1.5, 0.5(Model B) respectively. MAE for both the models would be 2. But, MSE would be 3.5 and 7.125. Because Model B has one high magnitude error (-5) it is getting penalized by MSE significantly.

Another way to interpret MSE is, Variance of error values(How widely dispersed errors are)!

Root Mean Squared Error

This is basically square root of MSE. Continuing to use the same example above, 3.5 and 7.125 MSE will be 1.87 and 2.67 RMSE. The only difference is, RMSE will have the same units as the target variable while MSE has squared units. As MSE is Variance of the error value, RMSE is Standard Deviation of errors.

Root Mean Squared Logarithmic Error

I have not used this until I participated in Kaggle competition. When observations are huge in magnitude for both actual and predicted values, error for that pair is going to be large compared to other smaller magnitude observations. For instance, you might come across a real estate dataset where there is a good mix of expensive mansions, average houses and ultra-cheap falling apart houses like these. If a model predicts small condos worth $100,000 as $50,000 then it is off by a lot but if the same model predicts a mansion’s price as $900,000 instead of $850,000 we can consider it close. The same error value of $50k is both massive and also insignificant in the same data set. So, in such cases, to avoid such relatively large differences between actual and predicted value contributing to error, we use RMSLE

Logarithms are usually a convenient way to express large numbers in much smaller magnitude. Check this, log value of 10000 is 4 while a log value of 5000 is 3.6989. When the regression models’ Y and Ŷ values vary widely, higher magnitude numbers increase error in RMSE, MSE & MAE significantly.

Calculate RMSLE :

  1. Get log value for Predicted + 1 & Actual + 1 and Take the difference between these two or Get log value of the ratio between Predicted + 1 & Actual + 1 : (log(Yᵢ + 1) — log(Ŷᵢ+1)) or log((Yᵢ + 1)/(Ŷᵢ+1)) (NOTE :For both predicted and actual, +1 is added to avoid undefined errors if predicted or actual is zero)
  2. Square each value and sum them up : Σ (log(Yᵢ + 1)-log(Ŷᵢ+1))²
  3. Take the square root of the sum to get RMSLE : √Σ (log(Yᵢ + 1)-log(Ŷᵢ+1))²

It could also be thought of as metrics which considers proportion between Prediction & Actual instead of difference. If Pred₁ = $50,000 , Actual₁ = $80,000 & Pred₂ = $500,000 , Actual₂ = $800,000. Then in both the cases, log ((P+1)/ (A+1)) is going to be the same.

I am not going to use Real estate example here to simplify the calculation. let’s say we have Y(Actual values) and Y^(Predicted Values) values for two different regression models on the same dataset as follows :

Model A:

Y : 10, 14, 18, 120, 140, 1, 2

Y^: 10, 13, 18, 100, 130, 1, 2

Model B:

Y : 10, 14, 18, 120, 140, 1, 2

Y^: 6, 9, 7, 119, 130, 1.1 , 1

For these values of Y and Y^, RMSE is 10.217 while RMSLE will be 0.0938 for Model A while for Model B RMSE is 7.25 while RMSLE is 0.4737. If we just take RMSE, Model B might look better. But, if you just glance over the results it is evident that Model A is performing better and RMSE score being higher is just because one prediction is off by a lot which also happens to be higher magnitude.

One other way to think about RMSLE is when one wants to penalize underestimates more than overestimates, RMSLE works well. For instance, Model A predicts house worth $800K as $600K and Model B predicts the same house price a $1M. Even though both these predictions are off by $200K, RMSLE value is higher for Model A(0.2876) than Model B(0.2231) while RMSE value remains the same.

The Coefficient of Determination or R Squared

We have metrics like RMSE, MSE, MAE. Comparing these values across a few models or a few different versions of the same model lets us pick the best model. But, What after we finalize a model? Is that selected model a good fit for the data? Is there scope for improvement? We can answer that using R-Squared value.

Going back to real estate example, let’s say you have 1,000 rows of data with different features determining the price of the house in an area. You have just 10 secs to estimate the value of a new house in that area. What would be the best option? Just take the average price of those 1000 houses and report that as an estimate for the new house. Even though this is not a great prediction, chances are high that it would definitely be less wrong than a random guess. This is called a baseline model or mean model or no relationship line which would be a parallel line to the x-axis. So, we could compare this to our fancier linear regression model to see how better the model is. This is what R-Squared value gives us.

So, if we take the difference between the Sum of Squared Errors by Mean line(SSEM) and Sum of Squared Errors by Regression Line(SSER), we get the amount of error reduced because of using regression line instead of the mean line. That difference divided by the Sum of Squared Errors by Mean Line gives the proportion of error reduced by regression line compared to the mean line, which is basically R-Squared value!

R² = (SSEM — SSER) / ( SSEM) = 1 — (SSER/SSEM)

This value is always between [0, 1]. It is also interpreted as variance explained by the model because SSE is basically Variance of error and by using a regression model instead of a mean model, Variance is reduced by a certain amount and this reduction in the error is what is “explained” by the model or interpreted by the model.

In the next post, I will talk about Adjusted R Square, Predicted R Square, Residual Plots, P values for variables, Regression coefficients.Stay Tuned!

Note:

You might see some equations with the denominator as n-p instead of n where p is the number of independent variables used to create a model. In my experience working in the online retail industry, it doesn’t really matter as usually n >>> p and hence n-p tends to n. But in the scale of classical statistics where small samples are encountered, having n vs n-p will make a significant difference. But why n-p or degree of freedom in classical statistics? This could be a potential blog post!

I love explaining complex concepts in a simple way. If you have any questions or just want to connect, you can find me on Linkedin or email me at manojraj.dobbali@gmail.com.

Also, Here is another blog post by neptune.ai that I recommend going through : https://neptune.ai/blog/performance-metrics-in-machine-learning-complete-guide

--

--

Responses (1)