Udacity AI Product Manager Program Review (Part II)

A Project Centric Review Approach

Erkan Hatipoğlu

Published in

Geek Culture

11 min readNov 25, 2021

A desk in front of a window with potted plants next to it. There is a laptop, pen, notebook, computer screen, mouse, and keyboard on the desk. — Photo by Grovemade on Unsplash

Overview

This article is the second episode of a three parts series on the review of the AI Product Manager Program offered by Udacity. Readers interested in part 1 can follow the link below.

Udacity AI Product Manager Program Review (Part I)

A Project Centric Review Approach

medium.com

As stated in the first part of this series, I will review through project assignments and publish each project in a different episode. Hence, the readers can read about the first project in part I, the second in part II, and the final in part III.

For each project, I will try to answer the following questions:

What is the project?
What is the outcome of the project?
What are the sources and tools to complete the project?
What is the solution?
What are the problems faced during the project implementation?
How can those problems be solved?
How can the project be improved?
What are the references for the project?

So let's start the second project.

Project 2 — Build a Model with Google AutoML

This project is a sequel to project 1, in which we have created directives for a labeling job that differentiates the healthy and pneumonia chest x-ray images.

In a typical machine learning (ML) workflow, the subsequent step after data gathering and preparation is the model-building step. Although this Nanodegree proposes a slightly different ml workflow that starts with the Business Goal and ends with Active Learning and Tuning, it is no different at this stage. We need to train and evaluate our model.

The Target

So, in this second project of the Nanodegree, we will build four different classification models that differentiate the healthy and pneumonia chest x-ray images. There will be no coding in this assignment. Instead, we will use Google AutoML. AutoML stands for Automated Machine learning. It is a tool to automate the ML workflow and find the best-performing model we can deploy from a given set of model algorithms for a given dataset. By using Google AutoML, we can build and deploy our models in the cloud.

Our project aims to build four models using Google's AutoML Vision platform and understand how data characteristics affect model performance. We will not deploy our model, but we will submit an AutoML Modeling Report for review in the end.

We will use the Kaggle chest x-ray dataset as in the previous project. We will select four different subsets of the dataset and build four ML models using each of those subsets. We will not label them using Appen's platform as in the previous project but use the original labels given in the dataset.

Google AutoML

To use Google's AutoML Vision platform, we will go to Google AutoML, click the TRY AUTOML button, and select AutoML Vision. After signing in/up to the account, we can create our project with any title. Please note that Google may charge you for this project according to your account status. But if you sign up, you'll be given a free $300 credit.

The Data

We will use four subsets of the Kaggle chest x-ray dataset, and I suggest preparing those subsets before modeling to ensure similar dataset usage for our models.

Create a binary classifier to detect pneumonia using chest x-rays

The first subset will contain 200 chest x-ray images, 100 from the 'normal' class and 100 from the 'pneumonia' class.

I will not go into the details of the training process except for one thing. If required, we can easily find help from Google and the Udacity project page. Instead, I would focus on the results and the modeling report we will submit for review.

I want to discuss the train/test split concept in the training process, which is very important in ML.

We can break up the model creation process in ML into three consecutive steps, which are the training, the hyperparameter tuning, and the evaluation steps. We need data in each stage to proceed to the next step. However, the data used in each phase should differ, and we must isolate these datasets from each other at the beginning. Otherwise, we will end up with a poor-performing model, which we do not want. So before creating a model, the first thing to do is split the dataset into three parts. These are the training set used in training, the validation set used in the hyper-parameter tuning, and the test set used in the evaluation.

So before the training, we will divide the dataset into 80%, 10%, and 10% portions, as in the image below. More data for training always means better model performance. So we will allocate as much data as possible for training (80%).

Screenshot of Google’s AutoML Vision platform showing class distribution for Normal and Pneumonia classes. 80% of the data is for training, 10% is for testing, and 10% is for validation. — Train/Test Split for Dataset-1 — Image by Author

After we finish the training, we can evaluate our model by looking at the confusion matrix.

A confusion matrix is a grid that shows all the predicted 'labels' relative to all actual 'labels.' In our case, we have four cells in the confusion matrix below.

Confusion matrix with cell_11 = 10, cell_12 = 0, cell_21 = 2, and cell_22 = 8. — Confusion Matrix for Dataset-1 — Image by Author

The explanation of each cell in a confusion matrix is as below.

True Positives (Row 1 Column 1, TP in short): TP is the number of positive labels predicted as positives by the best-performing model.
False Negatives (Row 1 Column 2, FN in short): FN is the number of positive labels predicted as negatives by the best-performing model.
False Positives (Row 2 Column 1, FP in short): FP is the number of negative labels predicted as positives by the best-performing model.
True Negatives (Row 2 Column 2, TN in short): TN is the number of negative labels predicted as negatives by the best-performing model.

As stated earlier, our test set has 10 'normal' and 10 'pneumonia' images. The confusion matrix tells us that our model has predicted the 10 out of 10 'normal' class images as 'normal' and 8 out of 10 'pneumonia' class images as 'pneumonia.'

In machine learning, we use various metrics to evaluate our model. One of the most popular metrics is model accuracy.

Model accuracy measures the percentage of accurate predictions against the total number of predictions for all classes. In mathematical terms:

The formula of accuracy is (TP+TN) divided by total accuracy, which is (TP+TN) divided by (TP + FP + TN + FN).

However, relying on accuracy may be misleading, especially for unbalanced datasets. In this project, we will not use 'accuracy.' Still, we will calculate it twice to see how it misleads us. The accuracy for the first model is as follows.

The accuracy is 10 plus 8 divided 10 plus 2 plus 8 plus 0, which is equal to 0.9 .

To evaluate our model, we will use two metrics calculated from the confusion matrix: precision and recall.

These metrics measure the model to understand how it performs for an individual class and across categories.

Model precision measures the percentage of correctly identified class instances against that class's total number of predictions. In mathematical terms:

The formula of precision is TP divided by (TP + FP).

Model recall measures the percentage of correctly identified instances of a class against the total possible samples of that class. In mathematical terms:

The formula of recall is TP divided by (TP + FN).

So for model-1, which is trained with dataset-1, the precision and recall of the 'normal' class are as follows.

Precision is 10 divided by (10 + 2), which is approximately 0.83 and recall is 10 divided (10 + 0) which is 1.

Create an unbalanced binary classifier

The second subset will contain 400 chest x-ray images, 100 from the 'normal' class and 300 from the 'pneumonia' class.

We will start by splitting our dataset into 80%, 10%, and 10% portions.

After the training, we will evaluate our model using the confusion matrix.

Confusion matrix with cell_11 = 6, cell_12 = 4, cell_21 = 0, and cell_22 = 30. — Confusion Matrix for Dataset-2 — Image by Author

Our test set has 10 'normal' and 30 'pneumonia' images. The confusion matrix tells us that our model has predicted 6 out of 10 'normal' class images as 'normal' and 30 out of 30 'pneumonia' class images as 'pneumonia.'

Let's calculate precision and recall for the 'normal' class.

Precision is 6 divided by (6 + 0), which is 1 and recall is 6 divided (6 + 4) which is 0.6.

For the 'normal' class, precision increases from 0,83 to 1, and recall decreases from 1 to 0,6.

Model-2 predicts mainly the unbalanced class. In other words, model-2 tends to generalize the inputs to the majority class (Pneumonia). As a result, recall decreases for the minority class.

If we calculate the accuracy for this model, we can see why accuracy is not a good choice for unbalanced datasets. The accuracy calculation is as follows.

The accuracy is 6 plus 30 divided by 6 plus 0 plus 30 plus 4, which equals 0.9.

As we can see easily, although the accuracy for this model is the same as the clean/balanced case, this model performs much worse than it. Therefore, as I have said earlier, using accuracy for unbalanced datasets will mislead us.

Create a binary classifier with dirty data

For the third dataset, we will begin with the original dataset of 100 'normal' and 100 'pneumonia' images used in the first dataset. Then we will reverse 30 labels in each class and misclassify 30% of the data.

Again we will start by splitting our dataset into 80%, 10%, and 10% portions.

After the training, we will evaluate our model using the confusion matrix.

Confusion matrix with cell_11 = 8, cell_12 = 2, cell_21 = 1, and cell_22 = 9. — Confusion Matrix for Dataset-3 — Image by Author

Our test set has 10 'normal' and 10 'pneumonia' images. The confusion matrix shown above tells us that our model has predicted 8 out of 10 'normal' class images as 'normal' and 9 out of 10 'pneumonia' class images as 'pneumonia.'

Let's calculate precision and recall for the 'normal' class.

Precision is (8) divided by (8 + 1), which is approximately 0.88, and recall is (8) divided by (8 + 2), which is 0.8.

For the 'normal' class, precision increases from 0,83 to 0,88, and recall decreases from 1 to 0,8.

We can easily see that the overall error (3) is larger than the clean/balanced case errors(2).

Moreover, there may be some dirty data in the test dataset, which makes the model unreliable, implying the real-life performance may be much worse than the evaluation.

Create a three-class model with' normal,' 'bacterial pneumonia,' and 'viral pneumonia.'

If we study the Kaggle chest x-ray dataset in detail, we'll see that viruses or bacteria can cause pneumonia. We can view this by looking at the image names in the dataset. If the image name contains 'virus,' it is viral pneumonia, and if the image name has 'bacteria,' it is bacterial pneumonia.

For the final model, we will create a three-class dataset with 100 images from the 'normal' class, 100 images from the 'bacterial pneumonia' class, and 100 images from the 'viral pneumonia' class.

We will start by splitting our dataset into 80%, 10%, and 10% portions.

Screenshot of Google’s AutoML Vision platform showing class distribution for Viral, Normal, and Bacterial Pneumonia classes. 80% of the data is for training, 10% is for testing, and 10% is for validation. — Train/Test Split for Dataset-4 — Image by Author

After the training, we will evaluate our model using the confusion matrix.

Confusion matrix with cell_11 = 10, cell_12 = 0, cell_13 = 0, cell_21 = 0, cell_22 = 10, cell_23 = 0, cell_31 = 6, cell_32 = 0, and cell_33 = 6. — Confusion Matrix for Dataset-4 — Image by Author

This time we have ten 'normal,' ten 'viral pneumonia,' and ten 'bacterial pneumonia' class images in our test set. The confusion matrix tells us that our model has predicted 10 out of 10 'viral pneumonia' class images as 'viral pneumonia,' 10 out of 10 'normal' class images as 'normal,' and only 6 out of 10 'bacterial pneumonia' class images as 'bacterial pneumonia.'

We can calculate precision and recall for the 'normal,' 'viral pneumonia,' and the 'bacterial pneumonia' classes.

For the 'normal' class:

Precision for Normal Pneumonia class is (10) divided by (10 + 0), which is 1, and recall is (10) divided by (10 + 0), which is 1.

For the 'viral pneumonia' class:

Precision for Viral Pneumonia class is (10) divided by (10 + 4), which is approximately 0.71, and recall is (10) divided by (10 + 0), which is 1.

And for the 'bacterial pneumonia' class:

Precision for Bacterial Pneumonia class is (6) divided by (6 + 0), which is 1, and recall is (6) divided by (6 + 4), which is 0.6.

It seems that the model confuses bacterial pneumonia with viral pneumonia. One method to solve this confusion is adding more data to training. So let's add 20 images for each class to create a new balanced dataset and retrain our model.

We have 120 images from the 'normal' class, 120 from the 'bacterial pneumonia' class, and 120 from the 'viral pneumonia' class. The new confusion matrix is as follows.

Confusion matrix with cell_11 = 11, cell_12 = 0, cell_13 = 1, cell_21 = 0, cell_22 = 12, cell_23 = 0, cell_31 =1, cell_32 = 0, and cell_33 = 11. — Confusion Matrix for Dataset-4 b — Image by Author

We've got a better result now. Let's calculate precision and recall for the last time.

For the 'normal' class:

Precision for Normal Pneumonia class is (12) divided by (12 + 0), which is 1, and recall is (12) divided by (12 + 0), which is 1.

For the 'viral pneumonia' class:

Precision for Viral Pneumonia class is (11) divided by (11 + 1), which is approximately 0.92, and recall is (11) divided by (11 + 1), which is approximately 0.92.

And for the 'bacterial pneumonia' class:

Precision for Bacterial Pneumonia class is (11) divided by (11 + 1), which is approximately 0.92, and recall is (11) divided by (11 + 1), which is approximately 0.92.

We can see that while the precision score for the 'viral pneumonia' class and the recall score for the 'bacterial pneumonia' class increases considerably, the recall score for the 'viral pneumonia' class and the precision score for the 'bacterial pneumonia' class decreases slightly. This phenomenon indicates better performance.

As a final stage, we will introduce a new metric to see this performance increase, the F1 score. We can calculate the F1 score using precision and recall, and the F1 score is the harmonic mean of precision and recall. In mathematical terms:

F1 Score equals two times precision times recall divided by the precision plus recall.

We will now calculate the F1 score of each class for the two datasets.

For the 'normal' class:

For both datasets, the F1 score for the Normal Pneumonia class is equal to (2) times (1) times (1) divided by (1) plus (1), which is equal to 1.

For the 'viral pneumonia' class:

And for the 'bacterial pneumonia' class:

For dataset_1, the F1 score for the Bacterial Pneumonia class is equal to (2) times (1) times (0.6) divided by (1) plus (0.6), which is 0.75. For dataset_2, the F1 score for the Bacterial Pneumonia class is equal to (2) times (0.92) times (0.92) divided by (0.92) plus (0.92), which is 0.92.

The F1 score improves for both viral and bacterial pneumonia classes, suggesting a performance increase.

Troubleshooting

This project's most challenging part was creating an AutoML Vision account since Google doesn't accept debit cards quickly, and trying different cards generally solves the issue.

Another problem arises while uploading the datasets. Sometimes Google doesn't upload all the images while complaining about duplication. Manually adding new images will solve the issue.

Finally, sometimes the unbalanced model gives unexpectedly good results. Maybe Google AutoML is trying to filter the unbalanced dataset into a balanced dataset. Using a new dataset or retraining the model will solve the issue.

Conclusion

In this project, we have learned how data properties affect model performance. We have seen that dirty and unbalanced data significantly decrease model performance.

In addition, we have seen that using accuracy may mislead us. Instead, precision, recall, or F1 scores may give better results.

We have also learned how to calculate precision, recall, and F1 score using the confusion matrix.

As a final note, some of you may notice that although I promised to discuss possible improvements to the project at the beginning of this article, I haven't mentioned any since I will discuss these improvements in the next episode. Stay tuned for the final episode. Refer to my GitHub repository or my first episode below for the Udacity AI Product Manager Program if you liked this subject or are interested in it.

Udacity AI Product Manager Program Review (Part I)

A Project Centric Review Approach

medium.com

Thank you for reading!

References

Udacity AI Product Manager Program Review (Part II)

A Project Centric Review Approach

Overview

Udacity AI Product Manager Program Review (Part I)

A Project Centric Review Approach

Project 2 — Build a Model with Google AutoML

Udacity AI Product Manager Program Review (Part I)

A Project Centric Review Approach

Written by Erkan Hatipoğlu