A brief guide to Model Evaluation Techniques: Machine Learning

Machine Learning Classification Models

Model Evaluation Techniques for Machine Learning Classification Models

Great Learning

Published in

Great Learning

9 min readAug 27, 2019

In machine learning, we often use the classification models to get a predicted result of population data. Classification is one of the two sections of supervised learning, and it deals with data from different categories. The training data-set trains the model to predict the unknown labels of population data. There are multiple algorithms: Logistic regression, K-nearest neighbor, Decision tree, Naive Bayes etc. All these algorithms have their own style of execution and different techniques of prediction. To find the most suitable algorithm for a particular business problem, there are few model evaluation techniques. In this article different model evaluation techniques will be discussed.

Confusion Matrix

It probably got its name from the state of confusion it deals with. If you remember hypothesis testing, you may recall the two errors we defined as type-I and type-II. As depicted in Fig.1, type-I error occurs when null hypothesis is rejected, which should not be in actual. Type-II error occurs when the alternate hypothesis is true, but you are failing to reject null hypothesis.

In figure 1 it is depicted clearly that the choice of confidence interval affects the probabilities of these errors to occur. But if you try to reduce either of these errors, it will result in the increase of the other one.

So, what is confusion matrix?

Confusion matrix is the image given above. It is a matrix representation of the results of any binary testing. For example let us take the case of predicting a disease. You have done some medical testing and with the help of the results of those tests, you are going to predict whether the person is having a disease. So, actually you are going to validate if the hypothesis of declaring a person as having disease is acceptable or not. Say, among 100 people you are predicting 20 people to have the disease. In actual only 15 people to have the disease and among those 15 people you have diagnosed 12 people correctly. So, if I put the result in a confusion matrix, it will look like the following —

*Fig.3: Confusion Matrix of prediction a disease*

So, if we compare fig.3 with fig.2 we will find —

True Positive: 12 (You have predicted the positive case correctly!)
True Negative: 77 (You have predicted negative case correctly!)
False Positive: 8 ( You have predicted these people as having a disease, which they actually don’t. But do not worry, this can be rectified in further medical analysis. So, this is a low risk error. This is type-II error in this case.)
False Negative: 3 (Oh ho! You have predicted these three poor fellows as fit. But actually they have the disease. This is dangerous! Be careful! This is type-I error in this case.)

Now if I ask what is the accuracy of the prediction model what I followed to get these results, the answer should be the ratio of the accurately predicted number and the total number of people which is (12+77)/100 = 0.89. If you study the confusion matrix thoroughly you will find the following things:

The top row is depicting the total number of predictions you did as having the disease. Among these predictions, you have predicted 12 people correctly to have the disease in actual. So, the ratio, 12/(12+8) = 0.6 is the measure of the accuracy of your model in detecting a person to have the disease. This is called Precision of the model.
Now, take the first column. This column represents the total number of people who are having the disease in actual. And you have predicted correctly for 12 of them. So, the ratio, 12/(12+3) = 0.8 is the measure of the accuracy of your model to detect a person having disease out of all the people who are having the disease in actual. This is termed as Recall.

Now, you may ask the question that why do we need to measure precision or recall to evaluate the model?

The answer is it is highly recommended when a particular result is very sensitive. For example you are going to build a model for a bank to predict fraudulent transactions. It is not very common to have a fraudulent transaction. In 1000 transactions, there may be 1 transaction which is fraud. So, undoubtedly your model will predict a transaction as non-fraudulent very accurately. So, in this case the whole accuracy does not matter as it will be always very high irrespective of the accuracy of the prediction of the fraudulent transactions as that is of very low percentage in the whole population. But the prediction of a fraudulent transaction as non-fraudulent is not desirable. So, in this case the measurement of precision will take a vital role to evaluate the model. It will help to understand out of all the actual fraudulent transactions, how many are being predicted. If it is low, even if the overall accuracy if high, the model is not acceptable.

Receiver Operating Characteristics (ROC) Curve

Measuring the area under the ROC curve is also a very useful method for evaluating a model. ROC is the ratio of True Positive Rate (TPR) and False Positive Rate (FPR) (see fig.2). In our disease detection example, TPR is the measure of the ratio between the number of accurate predictions of people having the disease and the total number of people having the disease in actual. FPR is the ratio between the number of people who are predicted as not to have disease correctly and the total number of people who are not having the disease in actual. So, if we plot the curve, it comes like this —

*Fig.4: ROC curve (source:* *https://www.medcalc.org/manual/roc-curves.php*)

The blue line denotes the change of TPR with different FPR for a model. More the ratio of the area under the curve and the total area (100 x 100 in this case) defines more the accuracy of the model. If it becomes 1, the model will be overfit and if it is equal below 0.5 (i.e when the curve is along the dotted diagonal line), the model will be too inaccurate to use.

For classification models, there are many other evaluation methods like Gain and Lift charts, Gini coefficient etc. But the in-depth knowledge about the confusion matrix can help to evaluate any classification model very effectively. So, in this article I tried to demystify the confusions around the confusion matrix to help the readers.

Example: Machine Learning Models Spotify uses to recommend music you’ll like

In the early 2000s, Songza implemented a manual music recommendation system for its listeners, where a team of music experts and curators would create playlists. But these recommendations were not objective, as they were dependent on the personal taste of the curators.

It was an average experience for listeners, with a fair share of hits and misses, because it was impossible to make a playlist which catered to the varied tastes of a diverse set of people. The technology and the data did not exist back then to build a playlist that would be personalized to the taste of each individual listener.

Along came Spotify a few years later, offering a highly personalized weekly playlist called Discover Weekly that quickly became one of their flagship offerings.

Every Monday, millions of listeners receive a fresh playlist of new song recommendations, customized to their personal tastes based on their listening history and the songs they’ve engaged with. Spotify uses a combination of different data aggregation and sorting methods to create their unique and powerful recommendation model that’s powered by machine learning.

“One of our flagship features is called Discover Weekly. Every Monday, we give you a list of 50 tracks that you haven’t heard before that we think you’re going to like. The ML engine that’s the main basis of it, and it’s advanced some since, had actually been around at Spotify a bit before Discover Weekly was there, just powering our Discover page” — David Murgatroyd, Machine Learning Leader at Spotify.

Spotify uses three forms of recommendation models to power Discover Weekly.

1. Collaborative Filtering

Collaborative Filtering is a popular technique used by recommend-er systems to make automated predictions about the preferences of users, based on the preference of other similar users.

On Spotify, the collaborative filtering algorithm compares multiple user-created playlists that have the songs that users have listened to. The algorithm then combs those playlists to look at other songs that appear in the playlists and recommends those songs.

This framework executed by matrix math in Python libraries. The algorithm first creates a matrix of all the active users and songs. The Python library then runs a series of complex factorization formula on the matrix. The end result is two separate vectors, where X is the user vector representing the taste of an individual user. Vector Y represents the profile of a single song. To find out users with similar taste, collaborative filtering will compare a given user vector with each and every single user vectors to give a similar user vector as the output. The same procedure is applied to the song vectors.

Spotify does not only rely on collaborative filtering. The second recommendation model used is NLP.

2. Natural Language Processing

NLP is the ability of an algorithm to understand speech and text in real-time. Spotify’s NLP constantly trawls the web to find articles, blog posts, or any other text about music, to come up with a profile for each song.

With all this scraped data, the NLP algorithm can classify songs based on the kind of language used to describe them and can match them with other songs that are discussed in the same vein. Artists and songs are assigned to classifying keywords based on the data, and each term has a certain weight assigned to them. Similar to collaborative filtering, a vector representation of the song is created, and that’s used to suggest similar songs.

3. Convolutional Neural Networks

Convolutional Neural Networks are used to hone the recommendation system and to increase accuracy because less-popular songs might be neglected by the other models. The CNN model ensures that obscure and new songs are considered.

The CNN model is most popularly used for facial recognition, and Spotify has configured the same model for audio files. Each song is converted into a raw audio file as a waveform. These wave forms are processed by the CNN and is assigned key parameters such as beats per minute, loudness, major/minor key and so on. Spotify then tries to match similar songs that have the same parameters as the songs their listeners like listening to.

With these key machine learning models, Spotify is able to tailor a unique playlist of music that surprises its listeners every week with songs they would have never found otherwise.

A key problem in many machine learning models is the lack of access to clean, structured data that can be processed. Spotify has been able to circumvent that problem due to their access to massive amounts of data that they collect from their users. They’ve been able to shine as a great example of effective use of Machine Learning models to give their users an unrivaled personalized experience.

Saikat Bhattacharya is a Senior Software engineer at Freshworks, and is pursuing the PGP-Machine Learning program from Great Learning. This article originally appeared on Towards Data Science and has been syndicated with permission from the author.

Happy modelling!