Machine Learning Basics

Ishara Madhavi
Analytics Vidhya
Published in
6 min readAug 31, 2020
Photo by <a href=”https://unsplash.com/@askkell?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyT
Photo by Andy Kelly on Unsplash

The advancements and opportunities that Machine Learning has evoked are limitless in value. Before answering the question, ‘What is Machine Learning?’, I would highlight its importance from the perspective of ‘Why Machine Learning?’.

Suppose you work as a programmer in the Grammarly team and you are assigned to write a program that can correct grammar or spelling mistakes made by online writers. Are you an expert in handling the English vocabulary? Even if you are, how many grammar rules and words you would have to explore, in order to facilitate a good accuracy? One advantage of Machine Learning is that it does this hard part of analyzing patterns and building rules, resulting in reduced time for programming. If you have a rich example set from which the machine can learn, you are already equipped with what you need.

Now that you have successfully built an English spelling/grammar corrector using Machine Learning, suppose you are given the task to create a spelling/grammar corrector for the Chinese language. If you are not aware of the true capabilities of Machine Learning, you would be stressed upon listening to this. All you have to do is finding a rich Chinese dataset and provide a customized product of Grammarly for Chinese writers.

Upon launching the new product, you are in the need to merge the two Grammarly versions (English and Chinese) to provide more flexibility to the writers. Although you can easily and clearly identify the English alphabet from the Chinese letters at first glance, how can you teach the computer to do so? Again Machine Learning can help to solve this seemingly unprogrammable task.

Now that you are amazed by what Machine Learning can serve, let’s dive into the basic concepts. Machine Learning addresses the problem, ‘How to build a computer program that automatically improves the doing through experience?’. If you carefully analyze a problem that Machine Learning is capable of solving, you would come up with 3 major metrics.

  1. Task
  2. Experience
  3. Performance

The task is the ‘doing’ or the skill that is expected to be mastered. The experience is the training obtained. The performance is the metric that evaluates the task. For example, a spam filter has the task to classify an arbitrary email as spam or non-spam. The experience for doing the task is obtained by the training examples/instances, which are the labeled emails. The spam filters of Gmail and Yahoo not only check emails based on pre-existing rules but also generate new rules as they continue classifying[1]. This training data can be retrieved when users read their mails and manually mark them as spam or not according to the content. The performance of the doing can be measured in various ways and one would be the classification accuracy (true classification percentage).

A model in Machine Learning is an algorithm trained using data, to replicate decisions that an expert would make, given the same information [2]. Training simply refers to finding the best parameters (weights and biases) of the algorithm from examining many examples from the training space. A “good training” should result in reduced errors in the decisions that the model makes. The error or the loss is a number that informs how bad the model has made decisions.

The next question you might be wondering is, ‘How do we determine a function for assessing this loss?’. Even before answering this, how do we know that the model is actually learning and improves its doing? Let’s assume we are given a mathematical function for calculating loss. In order to know if the model is performing in a good or bad manner, we follow an iterative approach named Gradient Descent. This is based on an observation that if a function is differentiable in a neighborhood of a point, then the function decreases fastest in the direction of the negative gradient of the function [3]. Or on the other hand, in order to minimize the loss, we have to sniff along the negative gradient and take steps proportional to the negative gradient of the function. As the model-algorithm is a function of unknown parameters (weights and biases), the function derivative is also a function of those parameters. By taking steps proportional to the negative gradient, these parameters are adjusted and we can assure the new parameters lead to a reduced loss which means the model is now learning/training better.

Figure 1. Adjustments to model weights using Gradient Descent

The choice of a loss function depends on the model-algorithm chosen, the degree of outlier data in the training space, and the ease of calculating the derivative.

Models learn patterns found in data and if the identified patterns are overly simple or overly complex, the models make mistakes or incorrect predictions for unseen data. If the model tries to fully fit the training data (overfit), by considering each training example (low bias — based on majority), the model makes incorrect predictions that highly deviate (high variance) from actual test data. If the model lightly fits the training data (underfit), by considering only a few training examples (high bias — based on minority), the model makes predictions that deviate less (low variance) from actual test data. The task is to find the sweet spot of the model complexity to retrieve the optimal parameters for the model-algorithm.

Figure 2. Bias-Variance Trade-off

Finding the right set of parameters is often challenging as although the model performs well for training data, in the presence of unseen test data the model can make poor decisions/predictions. The reason is the model has overly learned from the data so that it represents the training space perfectly but it is not generalizing the training space. This could result from the noisy data present in the training samples. How to ensure that the model has absorbed the major patterns in the data and at the same time strained out/ignored the noisy data? Or on the other hand, ‘How to mitigate the overfit problem?’. We can use Cross-Validation or Regularization techniques for this.

Cross-Validation gives a sense of model-performance on unseen test data. One simple Cross-Validation technique is Holdout Cross-Validation where we keep aside a portion of the data for validating and get predictions for this set after the model is trained from the remaining train data. By calculating the loss for the validation set, we can get a notion of the generalization power of the model. However, there is no guarantee that the most important data samples that help to recognize patterns would always remain in the train set and the model might learn very little from data due to insufficient amount of training data. To address gaps in this technique we can use K-Fold Cross-Validation. We iteratively split the train set into K partitions and at each iteration, leave out one partition for validating and calculate the loss over the left-out test partition. By averaging all the losses, we can see how the model has generalized for unseen data and it will learn from every hold out partition at some point in the training process. In order to guarantee a fair distribution of target classes in train/validation splits we can use Stratified K-Fold Cross-Validation where we make sure each fold contains balanced samples from target classes.

Another generalization technique is Regularization. This method tries to shrink the complexity (coefficients of model weights) towards zero. The complexity often arises from noise present in the training samples. Depending on diminishing the impact of weights or making the impacts zero, we have 2 regularization techniques.

  1. Lasso Regression (L1)
  2. Ridge Regression (L2)

A comparison of these 2 regularization techniques is given in the table below.

Figure 3. Comparison between Ridge and Lasso Regression

As this is a basic introduction to Machine Learning I will recommend an intuitive article for further details on Regularization.

I hope you enjoyed learning the basics of Machine Learning with intuitions!

Credits:

[1]https://www.sciencedirect.com/science/article/pii/S2405844018353404

[2]https://www.ospreydata.com/2020/02/24/ai-ml-models-101-what-is-a-model/

[3] https://en.wikipedia.org/wiki/Gradient_descent

[4] For details on Regularization: https://towardsdatascience.com/regularization-in-machine-learning-connecting-the-dots-c6e030bfaddd

--

--

Ishara Madhavi
Analytics Vidhya

Software Engineer at Sysco LABS | Graduate -Computer Science and Engineering -University of Moratuwa