Evaluating a Machine Learning Model

5 min readSep 10, 2019

So, you have trained your machine learning model. Maybe you’ve built a project that can detect pneumonia in a lung or filter through text. From here, you have to ask yourself:

How do I know this model will succeed? How will it perform in production?

To answer this important question, we need to understand how to evaluate a machine learning model. This is one of the core tasks in a machine learning workflow, and predicting and planning for a model’s success in production can be a daunting task.

What is Model Evaluation?

Model Evaluation is the process through which we quantify the quality of a system’s predictions. To do this, we measure the newly trained model performance on a new and independent dataset. This model will compare labeled data with it’s own predictions.

Model evaluation performance metrics teach us:

How well our model is performing
Is our model accurate enough to put into production
Will a larger training set improve my model’s performance?
Is my model under-fitting or over-fitting?

There are four different outcomes that can occur when your model performs classification predictions:

True positives occur when your system predicts that an observation belongs to a class and it actually does belong to that class.
True negatives occur when your system predicts that an observation does not belong to a class and it does not belong to that class.
False positives occur when you predict an observation belongs to a class when in reality it does not. Also known as a type 2 error.
False negatives occur when you predict an observation does not belong to a class when in fact it does. Also known as a type 1 error.

From the outcomes listed above, we can evaluate a model using various performance metrics.

Metrics for classification models

The following metrics are reported when evaluating classification models:

Accuracy measures the proportion of true results to total cases. Aim for a high accuracy rate.
accuracy = # correct predictions / # total data points
Log loss is a single score that represents the advantage of the classifier over a random prediction. The log loss measures the uncertainty of your model by comparing the probabilities of it’s outputs to the known values (ground truth).. You want to minimize log loss for the model as a whole.
Precision is the proportion of true results over all positive results.
Recall is the fraction of all correct results returned by the model.
F1-score is the weighted average of precision and recall between 0 and 1, where the ideal F-score value is 1.
AUC measures the area under the curve plotted with true positives on the y axis and false positives on the x axis. This metric is useful because it provides a single number that lets you compare models of different types.
Confusion Matrix the correlation between the label and the model’s classification. One axis of a confusion matrix is the label that the model predicted, and the other axis is the actual label. N represents the number of classes. In a binary classification problem, N=2
Suppose the test dataset contains 100 examples in the positive class and 200 examples in the negative class; then, the confusion table might look something like this:

Looking at the matrix, one can clearly tell that the positive class has lower accuracy (80/(20 + 80) = 80%) than the negative class (195/ (5 + 195) = 97.5%). This information is lost if one only looks at the overall accuracy, which in this case would be (80 + 195)/(100 + 200) = 91.7%.

Learn with an example:

To understand model evaluation, let us consider a Skyl.ai featured AI Model Q&A Topic Tags which classifies and tags questions as frame_type, is_scripting_language, and language.

This model is trained using Skyl’s state-of-art deep learning algorithm on a feature-set of size 6550 records, split as 90:10 train and test set.

This newly trained model has a training accuracy of 80.39% using 5895 records as training set and has a training evaluation score of 80.2% based on 655 test set.

How to evaluate models using new data?

To start evaluating models using data for unseen production data, start with manually labeling the data with actual class:

Once you have labeled this data, start predicting on a small-scale using your model. Upload this csv file that you have prepared, and then select the “Start Evaluation” button.

Once you model evaluation process of predicting the output is completed Skyl computes all the Model Evaluation performance metrics based on predicted and actual values.

When do I start Model Evaluation?

After you have trained every model as a test and validation set.
Periodically check your model for drift by evaluating new production data which are labeled with actual class.
After adding new features to increase the scope of your model.
Hyper-parameter tuning based on the evaluation metrics.

Conclude

Model evaluation is core task to perform in machine learning workflow, which is not just done once and knowing the ins and outs model evaluation will help you avoid many unhappy incidents on the way to machine learning. Happy modeling!!

About Skyl.ai: Our aim at Skyl.ai is to lower the barrier of entry to make AI available to the largest possible community of developers, researchers, and businesses.

Using Skyl.ai, business with limited ML expertise can start building their own high-quality custom models using advanced techniques such as deep learning and transfer learning. To get started, sign up with us at Skyl.ai

Originally Published: https://blog.skyl.ai/evaluating-a-machine-learning-model/