Understanding how Machine Learning is just like the human learning process?

Shaayan Hussain
Analytics Vidhya
Published in
13 min readApr 22, 2020

Day 1 of Machine learning? Follow the contents here and you’ll be building machine learning models in Python very soon all by yourself and would have enough understanding of each part of the implementation.

“Predicting the future isn’t magic, it’s artificial intelligence.”

You have been hearing a lot about Machine learning and the way it is transforming the existing practices of almost every field.

Terms like Machine learning, Data Science, Artificial Intelligence, whatever you heard about are overlapping with each other in various ways. But one point remains the same: Blending the processing power of machines and human intelligence.

To give an idea of why there is so much machine learning buzz, have a look at a few problems where ML has proven its worth across different industries.

  • Medical: Detecting tumors from images, finding out disease from symptoms, Clinical research, disease outbreak prediction, etc.
  • Retail: Detecting anomalies in sales or inventory, predicting demand for goods, etc.
  • E-commerce: Product recommendation, customer churn, customer retention, etc.
  • Finance: Fraud detection, process automation, predicting loan defaults, plan recommendation, etc.
  • Manufacturing: Predictive maintenance of machine parts by predicting future failures, supply chain management, etc.

Beyond these, today Machine learning has ground-breaking achievements in the research field and is helping curious minds to create magic every day.

Why ML is for everyone and not just for professionals?

We see problems around us where we want to make guesswork for our purpose. Let’s see an example.

You are getting late for the office. Before leaving, what is the one question that will hit you? Is there high expected traffic today on my way to the office? How do you solve this daily problem? How do you approach? You see what all things contribute to traffic patterns. What day is it, is this the peak hour of the day, did it rain today, etc. These are the primary questions that you answer and according to the combination you get, you try finding some similar instances from the past month or week to understand trends of traffic. Then you get to the conclusion about whether or not you should be expecting high traffic. In machine learning terms, you applied a Logistic Regression model where your parameters were the factors you considered, your previous knowledge of past days traffic is the training data, the answers you gave like it is a Monday, it is a peak hour, no it didn’t rain today will be your test values being passed to the algorithm called Logistic Regression which gave you binary output, two categories to choose from: YES or NO. Your brain is the running model.

Python and R and other languages have rich libraries that make ML pretty doable even without a lot of knowledge. People into different fields like healthcare, finance, etc leverage machine learning to solve their day-to-day problems related to their work.

Let’s get started….

We will now see how simple and intuitive Machine learning process flow is. Let’s find some parallels from our human learning process.

Let’s take the example of being able to solve any question on a new chapter of Mathematics.

  1. Collect learning material
  2. See the questions and understand the explained solutions.
  3. Try questions yourself without seeing the answer
  4. Validate your answers with the given correct answers
  5. Reiterate 1,2,3,4 till you get the underlying concept and are able to apply it to all the scenarios related to this chapter

This is the exact process of machine learning. Let’s understand the fundamental terminologies and how you can relate to them intuitively.

  1. Data collection — Collect learning material

You plan to use a machine learning model for solving a problem. Consider a model before its training as a student on the first day of his class where he doesn’t know anything. He needs to learn before being able to solve problems. First you need to collect the data for training the model. This can be an official data like Weather data, Government organization data or as simple as a data of your scores on different tests depending upon the problem you want to solve using the ML model.

2. Data Preparation- Once we have the data with us, we need to do a couple of checks and processes to make data rid of its shortcomings and utilize it to the maximum. Good data is extremely vital and difficult to gather. Data might be from different sources, we can combine it or it might be in a particular order and we need to reshuffle it so as to remove any bias caused by order. Just the way, you wouldn’t want all the questions from the sub-topic in order or you wouldn’t want repeated questions otherwise it would cause a bias. There are many ways of data preparation depending on the quality of data we gathered.

3. Choosing the model- Depending on the kind of problem we want to solve using machine learning, we would like to choose a suitable model. If we want to have a categorical answer to a question like what is the color of this box: Red, Blue or Green, we would use a Classification model or if it is a continuous value question like what will be the sales in a particular store in next month, a Regression model will be used. Also, model selection depends upon a lot of other factors like the quality of data we have, the data size is huge or small, whether the data is labeled or not (We have right answers or not to the math problems for example we discussed). These all we will be talking about in detail later.

4. Training- See the questions and understand the explained solutions.

Once we have the prepared data and we have selected the model, we would want to train the model using that data. The data we have is split into two parts: Train data and test data. Usually the convention is to keep a ratio like 70:30 or 80:20 for train: test. Just the way you would see solutions of 8/10 examples of the Mathematics chapter and leave 2 of them to attempt yourself first to estimate how well you have learned. We take the test data set and keep it aside. This test data is nowhere to be used until the training is completed. The splitting can be done manually but best is to use the libraries provided by the languages like Python or R. The pre-defined functions help maintain randomness of split which is important for ruling out any scope of bias.

Let’s try to understand why this is important by the ‘Math problems’ example. If you were to learn only a few particular kinds of problems(bias because you are not covering all kinds of problems), will you be able to answer the other kind of problems? No, right! That’s why it is important to train using the randomly selected data which can represent the whole underlying pattern or concept. Then only the model will be able to answer all kinds of problems based on data.

5. Evaluation- Try questions yourself without seeing the answer and validate with correct answers

Once the training is completed, time to check how well the model has learned. For this, the test data is passed to model and answers given by the model is matched with the correct answer already present in the data. This method of training and testing is called Supervised Learning where data already has the correct labels(answers). There are other ways to use when data doesn’t have labels and need to find out patterns on its own.

Now this whole process is iterated till the model starts scoring well. Study more or study well till you perform well enough.

Iterating the train-test process till model starts scoring well

We will try to analyze the potential reasons for the poor performance of the model.

What all can be the reasons for a student performing poorly?

Let’s first roughly see through examples and then we will talk about the importance and implications of each of them in Machine learning terms.

  1. Underfitting- The student didn’t read the learning material well enough to see all kinds of different examples and how solutions vary in different cases. He tried applying the same few concepts everywhere. Result: Can’t solve examples well, can’t solve test questions (poor train accuracy, poor test accuracy)
  2. Overfitting- The student over learned the examples with too specific values and cases given. Learned the variables and other unnecessary information as a concept. Tried finding the same variables and cases combination from learning material in test questions also without understanding that only the formulas and important concepts have to remain constant. Result: Can solve almost every example, can’t solve test questions (Good train accuracy, poor test accuracy)
  3. Outliers- The learning material had varying exceptional cases in examples. He mistook it for the underlying concept and tried generalizing it.
  4. Target Class imbalance- The student hadn’t seen many instances of the kind of questions that were asked in the test. There were very few examples of that category in the learning material. Assume a student spent 3–5 minutes on each question and there were only 2/100 examples of that kind. The overall learning in his head wouldn’t have a significant weightage of those. So in the test he didn’t know enough about them to be able to answer questions.
  5. Wrong metric for evaluation- All test patterns are not the same. Every subject has different criteria on which it is best to judge the understanding. Like in a language subject grammar and articulation would be important, in Mathematics formulas and diagrams, in History dates and names, in Coding naming convention, indentation, optimization and so on. Imagine shuffling these criteria between subjects. So you get the point, every model is built with a particular kind of data and a particular problem that needs to be solved. The model evaluation metric must be chosen accordingly.

Okay, so now we have seen the potential reasons, we should know that these are concepts and checks that are very important in the Machine learning process. So, let’s dive deeper to understand these concepts properly with the help of examples.

A quick question: Ever got confused in recognizing a water bottle? Like you mistook water can or a grain container or a moisturizer bottle as a water bottle. Hopefully not!

Well, what you call as common sense is actually a deep understanding of stuff to the extent that it becomes obvious. Don’t expect that from a machine. Machine knows exactly as much as it has been taught.

Let’s try to understand the concepts with this problem.

Problem: Water bottle recognition

Data: Images of various types of boxes, containers and bottles.

Training Data

Normally your amount of data can range anywhere from hundreds to millions or anything. For now assume this to be your train data set. Let’s visualize each of the potential reasons we discussed and see why they will make the model fail with test data.

  1. Model with Underfitting:
  • Learning: Water Bottle should have a body and a cover

Test Image :

Test data

Result: Yes. This is a water bottle (it has a body and a cover)

Now you see the issue. Very less understanding of patterns in train data and trying to apply it with test data led to the wrong result.

2. Model with Overfitting:

  • Learning: Water Bottle should have a body, a cover, a light blue body, and a darker blue cover. A bottle can be any of the given 6 patterns.

Test Image:

Test image

Result: No. This is a not a water bottle (it is not blue)

Too much learning of data along with unnecessary pattern which is specific to train data provided and not in general

3. Model with Outliers:

Assume the data have fancy images that don’t quite look like bottles. These will be called Outliers. An outlier can occur due to a manual mistake or a genuine weird value.

Genuine weird values
Mistake while gathering data

Learning: The above 2 images are also bottles. (they are part of train data)

Test data:

Test data

Result: Yes, it is a water bottle (based on combined learning from above 2 images)

  • Predictions are done by the model using formulas created using trends. Outliers distort the trends leading the model in a different direction.

4. Target class imbalance: For explaining this, we have a perfect real-life application. Machine learning models are used for Cancer prediction these days.

Snapshot of first 5 rows of a real cancer prediction dataset

Note- ‘Target’ variable is the column that we are trying to predict. Here it the column ‘diagnosis’ marked in red. Columns marked in black are the parameters.

Train data: Data on various parameters that are used to determine the presence of cancer.

Learning: Now the issue here is that majority of the data rows have cases of not having cancer (diagnosis: 0) with very few rows of instances where cancer is present(diagnosis: 1). So, the majority of learning is about the cases where cancer is not present.

Test data: A case of the presence of cancer

Result: The model might not be able to predict it and declare it otherwise. It is similar to that tiny topic which you read but can’t answer in the exam.

Issue: Here is something you need to pay attention to. ML models are not to be consumed in complete technical terms. Imagine out of 100 test cases, only 2 are cancer cases. But the model declares every case as ‘no cancer’.Still model will technically have an accuracy of 98% which looks pretty good in terms of performance but practically this model will be of zero use. The whole purpose of building this model was to detect the cancer cases which it can’t do.

5. Wrong metric for evaluation: Let’s take the same Cancer Prediction example. We saw the accuracy to be 98% but this is not reflective of the actual performance of the model. You see, the cost of making a mistake here is very high, it is the life of a human who could have been saved. We need to use a metric that can help in determining the real performance of the model, the utility of the model in practical terms.

There are many evaluation metrics possible. We will get introduced to some commonly used ones segregated in categories of Classification and Regression models.

Classification:

For these all, look at this chart made for categorical results:

CONFUSION MATRIX

‘Positive’ is the category you are building the model for. In cancer detection, positives will be cases having cancer.

People often get confused about the meanings of these TP, TF, FP, FN. Always interpret them this way.

  • True Positive(TP): It is True that it is Positive(Correct prediction-Actually positive, predicted as positive)
  • False Positive(FP): It is False that it is Positive(Wrong prediction-Actually non-positive, predicted as positive)
  • True Negative(TN): It is True that it is Negative(Correct prediction-Actually negative, predicted as negative)
  • False Negative(FN): It is False that it is Negative(Wrong prediction- Actually non-negative, predicted as negative)
  1. Accuracy- What proportion of predictions was correct out of all made predictions? (model basic overall performance check)

Accuracy = (TP+TN)/(TP+TN+FP+FN)

When to use: Accuracy can be used when there is no class imbalance in data otherwise the model will simply classify everything as majority class and still have a great accuracy(issue faced in Cancer detection problem).Read the definition again.

2. Precision- What proportion of predicted positives was actually positive? (model reliability check, concerned about only the positive ones)

Precision = TP /(TP+FP)

When to use: If there are less number of resources and you need to select positive ones, then you use Precision. Consider finding which ones to invest in out of 100 companies. With a low precision, you will get a list of 45 companies in which only 5 will be promising ones. You can’t invest in 45 companies in order to cover those 5 positive ones. You need a high precision model which can maybe give you 8 companies including those 5. Read the definition again.

3. Recall- What proportion of actual positives could be predicted correctly? (model productivity/robustness check, concerned about only the positive ones)

Recall = TP /(TP+FN)

When to use: If the cost of missing the positives is high, then you use Recall. Consider Cancer prediction, a low recall would mean declaring a cancer patient as healthy. This can cost a life. Read the definition again.

4. F1 score- How balanced it is in terms of both Precision and Recall? (to be used when both precision and recall are important)

F1 Score = (2*Precision*Recall) / (Precision + Recall)

When to use: Consider finding which customers to target. You have millions of customers. You can’t target a lot of the customers because it will incur costs (low precision scenario). At the same time you don’t want to miss out on customers who need to be targeted because it would lead to a business loss(low recall scenario). You need a balance of both. Read the definition again.

Regression:

  • RMSE: Root Mean Square Error
  • MAPE: Mean Absolute Percentage Error

We don’t want to get into details here because it will make the contents painfully long. Just to remember, go from right to left in the name of formula, apply those operations and that creates your formula.

  • R-squared
  • Adjusted-R squared

and many others…

So now you have learned the basics, time to implement some models. This is the link to Jupyter Notebook(IDE) where we will apply our understanding to build them using Python.

Let us know your feedback. Thanks for reading!

--

--