Stories by narcis teodoroiu on Medium

Regression: Goodness-of-Fit Measures

narcis teodoroiu — Wed, 29 Sep 2021 14:36:21 GMT

We technically can inspect all of the residuals to judge the model’s accuracy, but unsurprisingly, this does not scale if we have thousands or millions of data points. Thus, statisticians have developed summary measurements that take our collection of residuals and condense them into a single value that represents the predictive ability of our model.

Source: CHUTTERSNAP from Unsplash

None of the following measurements is completely sufficient on its own, so to measure our model we will need several measurements. Although there are many more, we will focus on the following:

Index

Standard deviation
SSR
SSE
SST
R²
Adjusted-R²
MAE
MSE
RMSE
MAPE
MPE
WMAPE

Standard Deviation (σ)

In statistics, the standard deviation is a measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean (also called the expected value) of the set, while a high standard deviation indicates that the values are spread out over a wider range.

High standard deviation → data is widely spread (less reliable)

Low standard deviation → data are clustered closely around the mean (more reliable)

Source: https://vitalflux.com/standard-deviation-sample-population-python-code/

SSE (Sum of squared error-residual)

The error is the difference between the observed value and the predicted value (yi).

Is the unexplained variation and represents the portion of the total variation that is not explained by the regression line.

Source: Image uploaded by the Author

Formulas comparation:

Source: https://towardsdatascience.com/explain-linear-regression-with-manual-calculation-1622affdce6b

SSR (Sum of squares regression)

It is the sum of the differences between the predicted value and the mean of the dependent variable.

Is the explained variation and represents the portion of the total variation that is explained by the regression line.

Think of it as a measure that describes how well our line fits the data.

Source: Image uploaded by the Author

SST (Sum of squares total)

Is the squared differences between the observed dependent variable and its mean.

Total variation in the data:

Total Variation(SST) = Explained Variation(SSR) + Unexplained(SSE) Variation

Source: Image uploaded by the Author

R² or Coefficient of Determination

Shows how well terms (data points) fit a curve or line.

Is a statistical measure that represents the proportion of the variance for a dependent variable that’s explained by an independent variable or variables in a regression model.

Cons:

R² increases with increasing terms even though the model is not actually improving.
Cannot determine whether the estimates and predictions of the coefficients are biased, and that is why the residual plots must be examined.

Between -∞ y 1.

If the model is negative, it is worse than predicting the mean.

Example: if the R² of a model is 0.50, then approximately half of the observed variation can be explained by the model’s inputs.

Adjusted R²

Also indicates how well terms fit a curve or line but adjusts for the number of terms in a model.

Pro:

It will increase if we add the useful terms and it will decrease if we add less useful predictors.

Between -∞ y 1.

If the model is negative, it is worse than predicting the mean.

Tells us how good/bad a model is.

Example: if a model has adjusted R² equal to 0.05 then it is definitely bad.

MAE (Mean Absolute Error — L1 Loss)

Average of the difference between the original values and the predicted values.

If we want a metric just to compare between two models from an interpretation point of view, then MAE may be a better choice.

Smaller the MAE, better is the model.

Range (0, + infinity]

Minimizing the absolute error (𝐿1) results in finding its median.

Taking only the absolute value of each so that negative and positive residuals do not cancel out.

Pros:

Robust to outliers. Averaging absolute values makes MAE more robust to outliers.
MAE has the same units as target values.
Easily interpretable.

Cons:

Because we use the absolute value of the residual, the MAE does not indicate underperformance or overperformance of the model.

Source: https://www.dataquest.io/blog/understanding-regression-error-metrics/

MSE (Mean Square Error — L2 Loss)

The most commonly used regression loss function.

MSE will almost always be bigger than the MAE.

Take the average of the square of the difference between the original values and the predicted values.

While each residual in MAE contributes proportionally to the total error, the error grows quadratically in MSE. This ultimately means that outliers in our data will contribute to much higher total error in the MSE than they would the MAE.

Minimizing the squared error (𝐿2) over a set of numbers results in finding its mean

Use if the outliers we should care about them.

Range (0, + infinity]

Cons:

MSE does not have the same units as target values.
Not robust to outliers.
Before applying MSE, we must eliminate all nulls/infinites from the input.

Source: https://www.i2tutorials.com/differences-between-mse-and-rmse/

RMSE (Root Mean Square Error)

The RMSE is analogous to the standard deviation and is a measure of how large the residuals are spread out.

RMSE have the same units as target values.

Generally, RMSE will be higher than or equal to MAE.

Range (0, + infinity]

Pros:

RMSE have the same units as target values.
Tell us how good/bad a prediction accuracy is. Not how good/bad is the model. (As a difference with Adjusted R² )

Cons:

Since the MSE and RMSE both square the residual, they are similarly affected by outliers. RMSE gives a relatively high weight to large errors due to the fact that the residual is squared before averaging.

Source: https://towardsdatascience.com/what-does-rmse-really-mean-806b65f2e48e

MAPE (Mean Absolute Percentage Error)

Is the percentage equivalent of MAE. The equation looks just like that of MAE, but with adjustments to convert everything into percentages.

Is a method of forecast error calculation that removes negatives from the equation.

MAPE is how far the model’s predictions are off from their corresponding outputs on average.

Pros:

Has a clear interpretation since percentages are easier for people to conceptualize.
As MAE, robust to the effects of outliers thanks to the use of absolute value.

Cons:

We are more limited in using MAPE than we are MAE.
Many of MAPE’s weaknesses actually stem from use division operation.
Now that we have to scale everything by the actual value, MAPE is undefined for data points where the value is 0.

Example:

Source: https://www.aindhae.com/2019/12/cara-menghitung-mean-absolute.html

MPE (Mean Percentage Error)

The mean percentage error (MPE) equation is exactly like that of MAPE. The only difference is that it lacks the absolute value operation.

Tell us if there’s more positive errors than negative, or vice-versa.

You can’t use MPE in the same way as MAPE.

Pros:

If there are more negative or positive errors, this bias will show up in the MPE.
Unlike MAE and MAPE, MPE is useful to us because it allows us to see if our model systematically underestimates (more negative error) or overestimates (positive error).

Source: https://www.dataquest.io/blog/understanding-regression-error-metrics/

WMAPE (Weight Mean Percentage Error)

It is a measure of prediction accuracy of a forecasting method.

This metric is very popular and also highly recommended for use.

Pro:

The advantage of this metric over MAPE is that this overcomes the ‘infinite error’ issue.

Thanks for reading this far!

I hope you found this insightful and helps you in your data science career :) If you enjoyed the content, be sure to follow me on Medium. As always, I wish you the best in your learning endeavors!

Not sure what to read next? I’ve picked another article for you:

Data Science: Statistical Basics

Narcis Teodoroiu

Did you found the article interesting? FOLLOW me on Medium.
If you are interested in networking, let’s CONNECT on LinkedIn.

Data Science: Statistical Basics

narcis teodoroiu — Sun, 22 Aug 2021 09:43:15 GMT

According to Wikipedia: “Data science is a “concept to unify statistics, data analysis, informatics, and their related methods” in order to“understand and analyze actual phenomena” with data.”

Source: pexels.com

If you have ever heard of Data Science, I am sure you already know that statistics are an important foundation of this beautiful field. Therefore I have decided to write this blog to present a series of basic concepts.

My mathematician mind makes me think in a structured way and I want my blogs to follow a similar pattern in which you can found a lot of images, examples and understand the concepts without having to read too much verbiage. That said, let’s start…

Index:

Population and sample

2. Mean, Median, Mode and Range

3. Distributions

Normal Distribution
Standardized Normal Distribution

4. Central Limit Theorem

5. Variability measures

Variance
Standard Deviation
Covariance
Coefficient of correlation

6. Outliers measures

Skewness
Kurtosis
IQR Method

Population and Sample

A population is the entire group that you want to draw conclusions about. Whilst a sample is the specific group tat you will collect data from. It is a subset of the population.

Source: Omniconvert.com

Mean, Median, Mode and Range

They express measures of central tendency. In different ways they each tell us what value in a data set is typical or representative of the data set.

The mean is the same as the average value of a dataset.

The median is the central number of the dataset.

The mode is the number that occurs most frequently in a dataset.

The range is the difference between the lowest value and the highest value.

Example: 7, 3, 4, 1, 7, 6

Mean: (7+3+4+1+7+6)/6 → 4.66
Median: 1, 3, 4, 6, 7, 7 → (4+6)/2=5
Mode: 7, 3, 4, 1, 7, 6 → 7
Range: 7–1 → 6

Distributions

Normal/Gaussian Distribution

It is a type of continuous probability distribution for a real random variable.

Can be described with just two parameters, mean and standard deviation.

Source: Michael Galarnyk

Properties:

The mean, mode and median are all equal.
The curve is symmetric at the center (i.e. around the mean).
Exactly half of the values are to the left of center and exactly half of the values are to the right.
The total area under the curve is 1.
Skewness and kurtosis.

Application in Machine Learning:

Data satisfying Normal Distribution is beneficial for model building. It makes math easier.
Algorithms which use Normal Distributions: Logistic Regression, Linear Regression, etc., are explicitly calculated from the assumption that the distribution is normal. So, we need to normalize the data before applying some machine learning algorithms.

Why it is important?

Found in the natural phenomena: Is the most important probability distribution in statistics because it fits many natural phenomena like age, height, test-scores, IQ scores, sum of the rolls of two dice and so on.
Mathematical reason: Central Limit Theorem.
Simplicity in mathematics. Namely, it’s mean, median and mode are all same. The entire distribution can be specified using just two parameters: mean and standard deviation.
Unlike many other distributions that change their nature on transformation, a Gaussian tends to remain a Gaussian (Product of two Gaussians is a Gaussian, convolution of Gaussian with another Gaussian is a Gaussian).

Normal distribution in real life:

Height. Most of the people in a specific population are of average height. The number of people taller and shorter than the average height people is almost equal, and a very small number of people are either extremely tall or extremely short.
Rolling a dice. In an experiment, it has been found that when a dice is roller 100 times, changes to get ‘1’ are 15–18% and if we roll the dice 1000 times, the changes to get ‘1’ is, again, the same.
IQ. The intelligence quotient of a majority of the people in the population lies in the normal range whereas the IQ of the rest of the population lies in the deviated range.
Technical stock market. The changes in the log values of Forex rates, prices indices and stock prices return often form a bell-shaped curve. For stock returns, the standard deviation is often called volatility. If returns are normally distributed, more than 99 percent of the returns are expected to fall within the deviations of the mean value.
And many more (Shoe size, birthday weight, income distribution in economy, etc.)

Standard Normal Distribution

The standard normal distribution is a special case of the normal distribution where the mean is 0 and the standard deviation is 1. This process is called standardization.

The normal distribution can take on any value as its mean and standard deviation. In the standard normal distribution, the mean and standard deviation are always fixed.

Every normal distribution can be converted to the standard normal distribution by turning the individual values into z-scores.

N(μ, σ) → Standard Normal Z ∼ N(0, 1)

Source: mathisfun.com

Empirical rule: 68/95/99.7

68% of observations within +- stdev from the mean.
95% of the observations are within +-2 stdev from the mean.
99.7% of observations are within +-3 stdev from the mean.
Values outside of +- 3 stedv account for less than 0.3% of observations, and, depending on the situation, could be considered outliers or signal noise.

We convert normal distributions into the standard normal distribution for several reasons:

To find the probability of observations in a distribution falling above or below a given value.
To find the probability that a sample mean significantly differs from a known population mean.
To compare scores on different distributions with different means and standard deviations.

Central Limit Theorem

Introduction in context: “Suppose we want to study the average age of the whole population of China. As the population of China is very high, it will be a tedious job to get everyone’s age data and will take a lot of time for the survey. So instead of doing that we can collect samples from different parts of China and try to make an inference. To work with samples we need an approximation theory which can simplify the process of calculating mean age of the whole population. Here the Central Limit Theorem comes into the picture. “

Definition: If you sample batches of data from any distribution and take the mean of each batch. Then the distribution of the means is going to resemble a Gaussian distribution — no matter what the shape of the population distribution.

Source: Wikipedia

Variability measures

Variance (σ²)

Definition: The average of the squared differences from the mean.

Disadvantage: It is expressed in much larger units (e.g., meters squared)

Standard Deviation (σ)

Definition: Measure of how spread out numbers are. This indicates how much the dataset deviates from the mean of the sample.

Advantage: Is expressed in the same units as the original values (e.g., meters)

Covariance

Definition: Measure the directional relationship between two variables.

Covariance is zero in case of independent variables because then the variables do not necessarily move together.

Disadvantages:

Range: -∞ and +∞
Is affected by the change in scale.

Coefficient of correlation

Definition: Measure the strength of the relationship between two variables. It is the normalized measurement of the covariance.

Independent movements do not contribute to the total correlation. Completely independent variables have a zero correlation.

Advantages:

Range: -1 and +1
Is not influenced by scaling.

Outliers measures

Skewness

Skewness is the measure of how much the probability distribution of a random variable deviates from the normal distribution. It is useful for the outliers checking. It measures the lack of symmetry in a data distribution.

There are two types of skewness:

Positive skewness. The tail on the right side of the distribution is longer or fatter. Mode< Median < Mean.
Negative skewness. The tail on the left side of the distribution is longer or fatter. Mean < Median < Mode.

Image: Sigmamagic.com

Why is important?

The tail region may act as an outlier for the statistical model and we know that outliers adversely affect the model’s performance, especially regression-based models. So there is a necessity to transform the skewed data to close enough to a Gaussian distribution.

Kurtosis

Kurtosis is a statistical measure that defines how heavily the tails of a distribution differ from the tails of a normal distribution. In other words, it identifies whether the tails contains extreme values in a given distribution.

There are three types of kurtosis:

Normal Kurtosis. A normal distribution has a kurtosis of 3.
High Kurtosis (>3). Distribution is longer, tails are fatter. Is an indicator that data has outliers. If there is a high kurtosis, then, we need to investigate why we have so many outliers.
Low Kurtosis (< 3). Distribution is shorter, tails are thinner than the normal distribution. Is an indicator that data has a lack of outliers. If we get low kurtosis (too good to be true), then also we need to investigate and trim the dataset of unwanted results.

Source: Analystprep.com

IQR Method

Interquartile range is the difference between Q3 and Q1.

Properties:

The median is the center point, also called second quartile, of the data (resulting from the fact that the data is ordered).
Q1 is the first quartile of the data, i.e., to say 25% of the data lies between minimum and Q1.
Q4 is the third quartile of the data, i.e., to say 75% of the data lies between minimum and Q3.

Source: Wikipedia

To detect the outliers using this method, we define a new range and any data point lying outside this range is considered as outlier and is accordingly dealt with. The range is as given below:

Lower Bound: Q1 = -1.5 * IQR
Upper Bound: Q3 = 1.5*IQR

Why ‘1.5’ ?

The rest 0.28% of the whole data lies outside three standard decisions (>3σ) of the mean (μ). This part of the data is considered as outliers. The first and the third quartiles, Q1 and Q3, lie at -0.675σ and +0.675σ from the mean, respectively. To get exactly 3σ, we need to take the scale = 1.7, but then 1.5 is more “symmetrical” than 1.7 and we’ve always been a little more inclined towards symmetry.

Thanks for reading this far!

I hope you found this insightful and helps you in your data science career :) If you enjoyed the content, be sure to follow me on Medium. As always, I wish you the best in your learning endeavors!

Narcis Teodoroiu

Did you found the article interesting? FOLLOW me on Medium.
If you are interested in networking, let’s CONNECT on LinkedIn.

Data Science: Statistical Basics was originally published in Analytics Vidhya on Medium, where people are continuing the conversation by highlighting and responding to this story.