How to build a machine learning model without coding knowledge

Asli Solmaz-Kaiser
Analytics Vidhya
Published in
8 min readJan 5, 2020

Manager’s Guide to Demistify A.I. — Part 2

Little Anton was on summer holiday with his parents in Italy. Every time he wanted to buy an ice cream, he had to ask for help from his parents as he did not speak Italian. Anton decided to learn Italian. The year after when he was again on holiday in Italy, he was able to buy ice cream by himself, whenever he wants. Anton was free due to his language skills. Learning makes you free…(Story from children’s philosophy book series “Les Goûters philo” of Brigitte Labbé.)

Photo by Sharon McCutcheon on Unsplash

This is exactly how you I felt when I built my first machine learning (ML) model. I was “free” due to my newly learned skills, able to understand and evaluate AI business cases easy, fast and in a better depth than before.

Building a ML model is also what this blogpost is about. It is the Part 2 of series “Manager’s Guide to Demistify A.I.”.

The series is made up of 3 parts:

  1. Understand the steps involved in developing an AI model (Part 1)
  2. Build your own AI prototype using Google AutoML (Part 2)
  3. Building the right AI model, avoiding bias (Part 3)

Build your own AI prototype using Google AutoML

In order to start building your ML model, you need to prepare:

  1. A correctly labeled dataset — in this example we will use a publicly available dataset
  2. Registration to an AutoML platform — in this example we will use Google AutoML

Please note: As Google AutoML is still in the beta version, you get 300$ credits for free, which is an excellent opportunity just to try out the platform. Just be careful as the credits run up pretty fast — though you will not be billed unless you explicitly upgrade your account.

The business case we will be looking at is as follows:

There are chest X-Ray images of different patients. The images show the lungs of the patients, partly healthy (normal) and partly showing symptoms of pneumonia (a kind of disease in the lungs). We want the machine to predict if a certain X-Ray image is showing pneumonia symptoms or not.

Here are the steps to be followed:

  1. Train the model
  2. Evaluate
  3. Predict or “Test & Use”
  1. Train the model

Preparation of the dataset, data labelling

Training the ML model is the process of teaching the model what the different images show so that it can make the necessary predictions. In order to train your ML model, you need to upload the source datasets in the platform.

Kaggle Chest X-Ray Dataset, categorised as “normal” and “pneumonia”.

These datasets, once uploaded, are used to teach the machine what is an X-Ray with pneumonia compared to a normal X-Ray. As dataset, in this example we will be using images available at Kaggle. (Kaggle is an online community of data scientists and machine learners where you can find data sets for your ML trials).

In machine learning, labeled (or annotated) data means your data is is able to teach the model what it needs to learn. As a result of this learning, your machine learning model can predict the answer to your question — e.g. “Does this x-ray show symptoms of pneumonia?”. In this example, labeled data is the data that is categorised as “normal” or “pneumonia”. The process of creating this dataset is called “data annotation” or “data labelling”. In a labelled dataset, as you see above, not every individual image needs to have their names labelled as “pneumonia” or “normal”. It is enough if the images are categorised accordingly — placed in the respective folders.

Below is an example image which is included in the source dataset to train the ML model. This is the image of a “normal” chest X-Ray. As you see it is clear, without cloudiness or opaque places.

The chest X-Ray of a “normal” patient, included in the folder “normal”

On the other hand, the chest X-Ray of a pneumonia patient shows unclear, opaque and cloudy areas. This is an example image of source dataset “pneumonia”.

The chest X-Ray of a pneumonia patient, included in the folder “pneumonia”

By examining these datasets, the model is learning how a normal chest X-Ray looks like vs. a pneumonia chest X-Ray. The more images the model is provided with, the better it can learn.

We upload both of the folders (normal / pneumonia) in the Google AutoML platform with the correct labelling. The different folders need to be zipped and uploaded at once.

This is how it will look once the data is uploaded

Tips about preparation of the dataset:

When you upload the dataset, it is important that:

  • the data is balanced: Data should include same number of X-Rays of normal and pneumonia lungs (different categories). An unbalanced dataset may lead to incorrect prediction of the ML model.
  • the data is correctly labeled: Only the healthy lungs are in the healthy folder and only the pneumonia lungs are in the pneumonia folder. Mixing these up may lead to incorrect prediction of the ML model.

Please note: A clean, balanced dataset, even if it is not a big one, will lead to more precise predictions and will cost less than a larger and unclean / unbalanced dataset.

It may not always be possible to find enough data that is already labelled for your specific use case. In this case you can look for a data annotation company who can support you with this task. This type of work, despite being crowdsourced to lower-cost countries like India or Philippines, still make up one of the highest cost items in machine learning. I will go into the details of data annotation in a separate blog. For now, let’s assume you have the data in hand ready to upload.

Please note: It is crucial do the necessary checks that the ML algorithm does not have bias. This will ensure an objective and precise prediction. We will look into this topic in detail in the next blogpost.

Uploading the labeled data

When you upload the data, platform will use part of the data for validation and test purposes, while the majority is taken for training. The validation and test data will help you evaluate the robustness of the model in the evaluation phase (we will come back to this later).

Once you upload the data, press “Start Training”. You can now sit back and relax. It will take several hours for the model to be trained. When this process is complete, you will be notified by mail.

Please note: Your model name should start with a letter. If it starts with a number, the “Start Training” button gets disabled like above, due to a current bug in the beta platform.

2. Evaluate

Evaluation is the part where you understand how probable it is that the model will make the right prediction once it is deployed. You can see this in the precision and recall rates.

Precision shows accuracy, what proportion of positive predictions were actually correct. Meaning, if the model predicted 10 images as pneumonia, from which 8 were actually pneumonia, then the precision rate is 0.8 or 80%(8/10). While precision measures the accuracy of the predictions, it does not provide the complete picture, how much of the total actual pneumonia data were correctly predicted.

Recall completes the picture of precision by adding the relevance of predictions to the complete actual dataset. Recall attempts to answer the question “What proportion of actual positives were predicted correctly?” Meaning, if the model made 8 correct predictions of pneumonia, but actually there were 12 pneumonia X-Rays, the recall rate would be 0,67 or 67% (8/12).

Please note: The closer the precision and recall rates are to 1, the higher the probability that the predictions of the ML model will be correct.

If you are not satisfied with the precision and recall rates of your trained model (e.g. <90%), then you should go back and retrain the model with a revised dataset.

In order to increase your precision and recall rates, some example questions you can ask are:

  • Is the data for the different labels balanced?
  • Is the data correctly labelled?
  • Is the amount of data provided for training sufficient?

You should repeat this process until you have an acceptable precision and recall rate. The acceptance level of the precision and recall rates for your model will depend very much on your use case: impact of the decision, the level of autonomy the model has, whether the results will be screened by a human before implementation, the complexity of the dataset and the difficulty of the predictions, etc.

Once you are satisfied with the evaluation results of your model, you can move further on to prediction or “Test & Use”.

3. Predict or “Test & Use”

Now that you have trained your model, it is time to let the model work for you. But before doing that, you can make a final test by uploading a dataset where you know the labels, let the ML model predict and check the accuracy of the predictions.

Attention: The dataset you upload at this stage should be different from the previously uploaded training data.

If you are happy with the correctness of the predictions, then you can upload the data that needs to be categorised and get the results.

When to use AutoML

The AutoML works in cases like this one where data is clearly labelled (normal/ pneumonia). This method of machine learning is called “supervised learning”.

In supervised learning, you train the model on the different categories by showing many examples of different labels. The model is not provided with specific descriptions of the different datasets to learn from. It is only learning by identifying the image patterns in the given categories.

In the chest x-ray example we used computer vision (CV), working with images but AutoML can also be applied to natural language processing (NLP), working with text.

Conclusion

In this blogpost I have tried to help you gain a new skill, which is building your own ML model using AutoML. If you plan to use supervised learning and a common use case it may be the right model for you, as well.

In the case of X-Rays, the AutoML model worked pretty good. After a small dataset of 100 images per label I was able to get precision and recall rates above 90%. According to Google this is the bare minimum and the recommended amount is above 1000 examples per label.

In addition to Google, there are also other AutoML providers like Microsoft Azure Automated ML or Amazon SageMaker Autopilot.

This blogpost is the Part 2 of series “Manager’s Guide to Demistify A.I.”. Part 1 can be found here. Part 3, A Primer on AI Safety, you can read here.

--

--

Analytics Vidhya
Analytics Vidhya

Published in Analytics Vidhya

Analytics Vidhya is a community of Generative AI and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Asli Solmaz-Kaiser
Asli Solmaz-Kaiser

Written by Asli Solmaz-Kaiser

Innovation, digitalization and AI for sustainable business growth