Guide into Data Science Interview Questions & Answers (3/5)

“Science is a way of life… Science is the process that takes us from confusion to understanding…” Brian Greene

Jose Del Rio
4 min readFeb 13, 2022

Contents: 1, 2, (3)

Photo by Possessed Photography on Unsplash

We have seen a little of SQL in the first part of the series… then some basic problems about probabilities in the second… now it is time we explore more interesting areas, going into some general questions in the field of Data Science.

Classifier

A very common problem in real applications is the classification of whatever is relevant for your company or users. So we will want to automate the classification of… your inventory, images (cars, people, fruits, toys…), articles (politics, breaking news, social, entertainment…), videos, music, users, etc...

In this scenario typically you have already something to start from, so you have your initial, relatively small, dataset which already includes the classification of your items into the labels or categories that you want to predict and now you want to be able to learn how to classify within those labels your bigger or future dataset.

Once we have built the model that has learnt from the categories that we had, you will want to understand how good is your model. How do we do that?What metrics are you going to use? How do we compare among different models with different architectures how good is each model?

Answering those questions, that is why in the following question we are going to explore the meaning of the confusion matrix:

We have a classifier that predicts if an image contains 
a cat, a dog, or a mouse.
What is the accuracy of the model in percentages?

The confusion matrix of the model is as follows:

And you are given the following options:

  • 45%
  • 60%
  • 75%
Photo by Chewy on Unsplash

The first step is to understand what the confusion matrix is telling us. If we follow the relation between what we predict and the true or actual values, we have:

  • True Positives (TP): Everything in the diagonal, we predict as true and it is actually true. If we analyse the behaviour for a single class (for example dogs), then we can look at the confusion table and we will have the True Negatives (TN) for the case when we predict something that does not belong to the class and that is actually true (basically, in the example, classifying cats as no-dogs).
  • False Positives (FP): What we predict as true but is not. In our example from the point of view of dog predictions, the 4 and 1 of the first row are false positives. In a perfect model, those values would have been zero and they would have ended up in the following rows.
  • False Negatives (FN): What we predict as false but is true. Again in our example, the 2 and 7 of the first column should be dogs but the model predicted them as something else. Again, in a perfect model, those values would have been zero and instead of 22 the value would have been 31.
  • True Negatives (TN): When we have a binary classification problem (true or false), we have the cases that are false/negative and we can count if those are classified correctly.

Now if we recall the definition of Accuracy, we want the true predictions divided by all cases, true and false predictions:

Accuracy = (TP + TN) / (TP + TN + FP + FN)

Following that simple formula, we obtained:

Total = 22 + 4 + 1 + 2 + 16 + 3 + 7 + 2 + 19 = 76
Accuracy = (22 + 16 + 19) / 76 = 0.75

Other typical questions related with classification, may include what is the precision or the recall of the model.

The idea of precision is to understand how many true positives we detect, across the ones we predict to be positives (which include some false positives)

Precision = TP / (TP + FP)

While recall gives us the understanding of the positives detected across the real amount. This way we have a view into how many true positives we are missing out in the prediction.

Recall = TP / (TP + FN)

In our case, the results for the class ‘dog’ are:

Precision (𝒹ₒ𝓰) = (22) / (22 + 4 + 1) = 0.815
Recall (𝒹ₒ𝓰) =
22 / (22 + 2 + 7 + ) = 0.71

And if we want the metrics for the model across all classes, we need to calculate the metrics per class and then obtain the mean. So we just repeat the process for each class, simple right?

It is important to notice that all the metrics can be derived from the Confusion Matrix, so it is fundamental to understand what information it represents and how to read it.

Thanks for reading

I hope you have enjoyed this article. Please leave your thoughts and ideas if you are interested in the topic.

--

--

Jose Del Rio

Senior Data Scientist. Making machine learning more approachable.