Evaluating the Quality of a model, algorithm, or prediction

Shweta Dixit
7 min readJul 13, 2023

--

Evaluation metrics, performance metrics or assessment metrics play a crucial role in statistics as they provide quantitative measures to assess the performance and quality of statistical models, algorithm, or prediction in various fields of study. Here’s a few things you should know about the evaluation metrics before using.

Sum of Squared Residuals (SSR): The SSR measures the total discrepancy or variation between the observed values and the predicted values obtained from a statistical model. It quantifies the sum of the squared differences between the observed dependent variable values and the predicted values. A lower value of SSR indicates a better fit of the model to the data. SSR is useful in determining the overall goodness of fit of a model. In R, you can calculate SSR using the residuals from a linear regression model. Here’s a general code. Lower values of SSR indicate better model fit.

# Assuming you have a linear regression model named "model" and the dependent variable is "y"
residuals <- residuals(model)
ssr <- sum(residuals^2)

Mean Squared Error (MSE): The MSE is calculated by taking the average of the squared differences between the observed and predicted values. It provides an estimate of the average prediction error of the model. Like SSR, a lower value of MSE indicates a better fit of the model. MSE is widely used in assessing the accuracy and precision of statistical models.

In R, you can calculate MSE using the mean() function. Here's an example:

mse <- mean(residuals^2)

R-squared (R²): R-squared measures the proportion of variance in the dependent variable that is explained by the independent variables in a statistical model. It quantifies the goodness of fit and represents the percentage of variability in the response variable that can be accounted for by the predictors. R-squared ranges from 0 to 1, with a higher value indicating a better fit of the model to the data. R-squared is a commonly used metric to evaluate the explanatory power of a regression model.

In R, you can obtain R-squared using the summary() function on a linear regression model. Here's an example:

# Assuming you have a linear regression model named "model"
summary(model)$r.squared

Adjusted R-squared (Adjusted R²): Adjusted R-squared is an extension of R-squared that adjusts for the number of predictors in the model. It penalizes the inclusion of unnecessary predictors and provides a more reliable measure of model fit when comparing models with different numbers of predictors. Adjusted R-squared takes into account the degrees of freedom and provides a more conservative estimate of the model’s explanatory power.

In R, you can obtain Adjusted R-squared using the summary() function on a linear regression model. Here's an example:

# Assuming you have a linear regression model named "model"
summary(model)$adj.r.squared

These evaluation metrics are widely used in statistical analysis to assess the accuracy, fit, and explanatory power of models. By utilizing these metrics, statisticians can make informed decisions about model selection, interpret the results, and compare different models based on their performance and predictive abilities.

We can now guess the importance of these Evaluation metrics in performance and accuracy of statistical models in various types of analyses. Further in this context, we will discuss the discussed evaluation metrics in detail along with the working examples in the inbuilt R datasets for its better understanding. These metrics are commonly used in regression analysis, but they can be applicable to other types of analysis as well.

These measures can work around with any working dataset in hand but for a real dataset example, we’ll use the built-in iris dataset in R, which contains information about various flowers:

# Load the "iris" dataset
data(iris)

# Split the data into training and testing sets
set.seed(123)
trainIndex <- sample(1:nrow(iris), 0.7 * nrow(iris))
trainData <- iris[trainIndex, ]
testData <- iris[-trainIndex, ]
# Perform a linear regression analysis
model <- lm(Sepal.Length ~ ., data = trainData)
# Obtain the model summary
summary(model)

The lm() function is used to fit a linear regression model. In this case, we are predicting Sepal.Length using others as independent variables.

After running the regression analysis, you can obtain the evaluation metrics using the summary of the model.

# Calculate evaluation metrics
residuals <- residuals(model) # Residuals
ssr <- sum(residuals^2) # SSR
mse <- mean(residuals^2) # MSE
r_squared <- summary(model)$r.squared # R-squared
adj_r_squared <- summary(model)$adj.r.squared # Adjusted R-squared
# Print the evaluation metrics
cat("Sum of Squared Residuals (SSR):", ssr, "\n")
cat("Mean Squared Error (MSE):", mse, "\n")
cat("R-squared (R²):", r_squared, "\n")
cat("Adjusted R-squared (Adjusted R²):", adj_r_squared, "\n")
> # Print the evaluation metrics
> cat("Sum of Squared Residuals (SSR):", ssr, "\n")
Sum of Squared Residuals (SSR): 10.10657
> cat("Mean Squared Error (MSE):", mse, "\n")
Mean Squared Error (MSE): 0.09625307
> cat("R-squared (R²):", r_squared, "\n")
R-squared (R²): 0.8727428
> cat("Adjusted R-squared (Adjusted R²):", adj_r_squared, "\n")
Adjusted R-squared (Adjusted R²): 0.8663157

In this example, we fit a linear regression model. We then calculate the evaluation metrics using the residuals obtained from the model summary.

The above mentioned metrices are good enough to have a strong grip into your model’s performance, but additionally, there are several other evaluation metrics commonly employed in different domains. Some of which needs more understanding than others. Here are a few with brief description:

Mean Absolute Error (MAE): MAE measures the average absolute difference between the predicted and observed values. It provides a measure of the average magnitude of the errors, regardless of their direction. MAE is useful when the focus is on the magnitude of the errors rather than their direction.

Root Mean Squared Error (RMSE): RMSE is the square root of the mean squared error. It provides an estimate of the average prediction error, similar to MSE, but on the original scale of the data. RMSE is widely used in various fields, including regression analysis, time series analysis, and machine learning.

Accuracy: Accuracy is a metric commonly used in classification problems to measure the proportion of correctly classified instances out of the total number of instances. It is particularly useful when the class distribution is balanced. However, accuracy may not be reliable when dealing with imbalanced datasets.

Precision and Recall: Precision and recall are metrics often used in binary classification problems. Precision measures the proportion of true positive predictions out of all positive predictions, while recall (also known as sensitivity or true positive rate) measures the proportion of true positive predictions out of all actual positive instances.

F1 Score: The F1 score is the harmonic mean of precision and recall, providing a balanced measure of a model’s performance in binary classification problems. It considers both precision and recall simultaneously and is particularly useful when there is an imbalance between the classes.

Area Under the ROC Curve (AUC-ROC): AUC-ROC is a metric commonly used to evaluate the performance of binary classification models. It measures the ability of a model to distinguish between the positive and negative classes by calculating the area under the receiver operating characteristic (ROC) curve. AUC-ROC ranges from 0 to 1, with a higher value indicating better discrimination.

The code chunk below will help you evaluating the additional mentioned metrices using the cars datasets is as below:

# Load the "iris" dataset
data(iris)

# Split the data into training and testing sets
set.seed(123)
trainIndex <- sample(1:nrow(iris), 0.7 * nrow(iris))
trainData <- iris[trainIndex, ]
testData <- iris[-trainIndex, ]

# Build a linear regression model
model <- lm(Sepal.Length ~ ., data = trainData)

# Make predictions on the test set
predictions <- predict(model, newdata = testData)

# Calculate evaluation metrics
MAE <- mean(abs(predictions - testData$Sepal.Length))
RMSE <- sqrt(mean((predictions - testData$Sepal.Length)^2))

# Print the evaluation metrics
cat("Mean Absolute Error (MAE):", MAE, "\n")
cat("Root Mean Squared Error (RMSE):", RMSE, "\n")
# Print the evaluation metrics
> cat("Mean Absolute Error (MAE):", MAE, "\n")
Mean Absolute Error (MAE): 0.2266138
> cat("Root Mean Squared Error (RMSE):", RMSE, "\n")
Root Mean Squared Error (RMSE): 0.2797267

In this example, we perform a regression analysis on the iris dataset. We split the data into training and testing sets, build a linear regression model, make predictions on the test set, and calculate the evaluation metrics.

# Load the "iris" dataset
data(iris)

# Split the data into training and testing sets
set.seed(123)
trainIndex <- sample(1:nrow(iris), 0.7 * nrow(iris))
trainData <- iris[trainIndex, ]
testData <- iris[-trainIndex, ]


# Build a classification model (e.g., logistic regression, decision tree, etc.)
# For simplicity, let's use a logistic regression model
library(nnet)
model <- multinom(Species ~ ., data = trainData)

# Make predictions on the test set

predictions <- predict(model, newdata = testData)

# Calculate evaluation metrics
MAE <- mean(abs(predictions - testData$Species))
RMSE <- sqrt(mean((predictions - testData$Species)^2))

predictedLabels <- predict(model, newdata = testData, type = "class")

# Calculate evaluation metrics
accuracy <- sum(predictedLabels == testData$Species) / nrow(testData)

# Calculate confusion matrix
confusionMatrix <- table(predictedLabels, testData$Species)

# Calculate precision
precision <- diag(confusionMatrix) / colSums(confusionMatrix)

# Calculate recall
recall <- diag(confusionMatrix) / rowSums(confusionMatrix)

# Calculate F1 score
f1_score <- 2 * precision * recall / (precision + recall)

# Print the evaluation metrics

cat("Accuracy:", accuracy, "\n")
cat("Precision:", precision, "\n")
cat("Recall:", recall, "\n")
cat("F1 Score:", f1_score, "\n")
# Print the evaluation metrics
> cat("Accuracy:", accuracy, "\n")
Accuracy: 0.9777778
> cat("Precision:", precision, "\n")
Precision: 1 0.9444444 1
> cat("Recall:", recall, "\n")
Recall: 1 1 0.9285714
> cat("F1 Score:", f1_score, "\n")
F1 Score: 1 0.9714286 0.962963

We use the “caret” package to calculate additional metrics such as Accuracy, Precision, Recall, F1 Score. However, note that AUC-ROC is not applicable for regression tasks, so we set it to NA.

By running this code, you will obtain the evaluation metrics. These metrics provide insights into the model’s fit, prediction accuracy, and the amount of variance explained by the predictors.

These are just a glimpse of a use case of evaluation metrics, but in a broader picture of the modelling paradigm this concept can be and must be used in various fields of study. OfCourse, the choice of evaluation metrics depends on the specific problem, the type of analysis, and the nature of the data. It’s important to select metrics that align with the research goals and provide meaningful insights into the performance of the model or algorithm being evaluated.

Feel free to skip things if you are struggling in between as its completely okay if you are not understanding it at first go. Keep trying the learnings you have achieved so far by making the dots, you will get in there in sometime.

Happy learning!

Keep pushing your limit!

--

--

Shweta Dixit

||LearneR||Academician|| Researcher|| Biostatistician||