Machine Learning Basics
This blog covers the basics of Machine Learning and related terms.
What is Machine Learning?
Let us have a look at the first ever definition of Machine Learning:
“Field of study that gives computers the ability to learn without being explicitly programmed” — Arthur Samuel, 1959
Instead of rule based programming, if a machine learns to solve a problem on unseen data, it is defined as Machine Learning.
By saying “Machine Learning”, we actually mean to train an algorithm to solve a specific task for us. Training the algorithm can also be referred as training a model. You can relate the training of a model to the training we underwent during our school times. We used to sit for boring classes and then write exams. Once passed out of the school, we used that learning in the real world to find solution to unseen challenges. Similarly, we first train our model (classroom learning), validate it (written exams) and then test it (real world challenges).
Types of Machine Learning
- Supervised Learning:
In Supervised Learning, the algorithm is given training data in which every sample has its label (target value). The label is also known as Ground Truth. It means:
Ɲ Training samples of the form {(x1, y1),…,(xn, yn)}
where xi is feature vector of the i-th example and yi is its label or class.
For example, in a ‘Email Classification’ task, every email in the training data contains whether it is ‘SPAM’ or ‘NOT SPAM’.
Some example of Supervised Learning algorithms are KNN, Logistic Regression, Linear Regression, Random Forest, etc.
2. Unsupervised Learning:
In Unsupervised Learning, the algorithm is given training data without label of its sample. It means:
Ɲ Training samples of the form {(x1),…,(xn)}
It is used for clustering of samples in different groups based on the distinctive features.
Some example of Unsupervised Learning algorithms are Apriori algorithm, K-means.
3. Reinforcement Learning:
In this kind of learning, the algorithm has to make some decisions based on the reward it earns during training. The reward can be -1 for negative progress, 0 for no progress, and 1 for positive progress. For a example, consider learning to play the game of “Virtua Cop”. If you (an agent) kill an enemy, you observe that you score positive points and if you kill a hostage, you lose. In order to play efficiently, you will refrain yourself from killing innocent hostages and kill only enemies. You learnt this based upon the score you earned. Similarly, our machine learning algorithm (agent) learns to solve the given problem at hand by looking at the reward it earns after taking some action. It will prefer taking actions which earns him more positive rewards and refrain itself from earning negative rewards.
Types of Machine Learning Problems
- Classification problem:
It is a type of Supervised Learning in which the task is to find the correct label of the data. The number of labels, out of which the model has to do the prediction, has to be finite. For example, by looking at an image, the model has to decide whether it has a dog or a cat in it. In this case, the labels or classes are “dog” and “cat”.
2. Regression problem:
It is also a type of Supervised Learning in which the task is to predict continuous real values. For example, predicting the stock market value or predicting the price of houses based upon the description. In these examples, the target values can not be finite and has to be continuous.
3. Clustering problem:
It is a type of Unsupervised Learning in which the model learns to group data samples based upon the common features, thus forming clusters. For example, grouping customers based on similar buying behavior.
Dataset:
A dataset is a collection of samples which is similar to the data found in the actual test environment of the problem we are solving.
In Machine Learning, the data is given utmost importance. Even before you start training the model, you have to preprocess the data so that it will be easy for the model to learn quickly and efficiently. Given the problem, methods employed for preprocessing will be different. For example, if your dataset includes images, you first have to normalize the pixel values. You can also do various kind of transformations on images to get more images. Transformation may include rotation, scaling, shearing, reflection, etc. Similarly on text dataset, you might have to clean the dataset by removing stop words first followed by lemmatization and stemming of the words, etc.
Dividing the dataset into Train, Validation, Test:
Once we are done with the preprocessing phase of the data, we have to divide the whole set into three parts: Train, Validation, Test. The ratio can be kept similar to 6:3:1. The whole idea is to use Train dataset for training the model, Validation dataset for validating the model and tuning the hyper-parameters if needed and Test dataset only for testing the model. Test dataset is kept unseen from the developer. It should not be used at all during training. Don’t worry about the term ‘hyper-parameters’, it will be explained later.
Cross Validation
If the size of dataset is small, instead of partitioning it into 6:3:1 ratio, we can divide it only two parts, one is Train and the other is Test by following 8:2 ratio or similar to it. The model is validated by following the below steps:
- Split the Training dataset into k parts (folds), where k is again a hyper-parameter of integer type which you can select. For example, k = 10.
- Choose a set of hyper-parameters for your model.
- Train your model on first k -1 (=9) parts.
- Check the performance (by measuring the accuracy/loss) on the kth part.
- Repeat steps 2 and 3 for k (=10) times and in each time use different part for checking the performance.
- Average the performance across the k folds. This is the performance metric for the set of hyper-parameters you used for the training.
Training the model:
After dividing the dataset, we have to decide the algorithm to train for our use case. Let’s look at few definitions related to training a model.
Parameter/Weights:
Parameters or Weights are the values which are learned by the model while training. By learning the weights mean updating the randomly initialized values in a way that the model can correctly do the predictions. After the training is over, these weights represent the inference of the whole training data. For further testing, only these parameters can be used for predictions and the whole training data can be discarded.
Hyper-parameter:
A hyper-parameter is a value which cannot be estimated from the data. The value/s must be set manually and tuned. The value of a hyper-parameter depends on the algorithm used and the dataset. For example, in KNN algorithm, the value K is a hyper-parameter.
Epoch:
It is a hyper-parameter of type Integer. An epoch is passing the entire dataset once through the algorithm to update the parameters. Many algorithms need multiple iterations over the dataset for complete training. For such algorithms, number of epochs will be more.
Loss function:
In most of the Machine Learning algorithms, the loss is calculated as the difference between the predicted output (ŷ) and the actual output (y).
J(w) = y — ŷ
J(.) is the loss function. Different loss functions give different prediction. So, it is important to choose it carefully. Mean Squared Error (MSE) is widely used loss function. It is calculated as the mean of the squared difference between the predicted value and the target value.
The graph of simple MSE looks like below:
Y axis represents cost and X axis represents the hyper-parameter.
Different problems (classification or regression) uses different loss functions.
Gradient Descent:
Gradient Descent is an optimization method in which the weights of the model are updated so that model gets the minimum loss (minima of the loss graph).
Gradient is the slope of the curve at any given point. It also means the rate of inclination or declination of a slope.
Enough of magic words! Let’s get to a simple explanation:
Assume you are standing on top of a hill and your goal is to reach the valley. The trick is that you are blindfolded. The first thing what you will do is to move a step in one direction and check if you can feel whether you are going down or up. If you feel you are going down the hill, you continue doing so until you reach the bottom. Similarly, consider standing on top of the hill as standing on top of the loss curve. You find the gradient (differentiation of loss function) which represents slope of the curve, and then you update the value of parameters so that you descend down the slope.
Consider the above example again. When you are descending down the hill, you decide whether you should take big steps or small steps to reach the valley. If you take big steps of equal distance, you might reach the valley soon but chances are you might overshoot the exact lower point. If you take small steps, you will reach the lower point for sure but slowly. This rate of descend is called Learning rate. Normally learning rate takes values in fractions, like 0.0001 etc.
Let’s look at the mathematical formulation of the above analogy.
Consider the right hand side of the below image, the one with blue scattered data.
The goal of the learning model is to find the best fit (best line that can describe the data). For this purpose, we will use below straight line equation:
Where ŷ is the predicted value, m is the slope of the line and c is the constant (simple equation of a line from our school time).
The above line is the best fit line only when we have the correct value of m and c. Let’s find the error based upon equation 1:
As we know, we have to find the gradient of this loss function with respect to its parameters m and c.
Equation 1 can be substituted in equation 2 as follows:
Now we can partially differentiate equation 3 w.r.t. m and c.
Just hold on for one more step and I will spare you with all the maths! Actually, next step is just to substitute equation 1 in equation 4 and 5. Phew!
Now that you have skillfully found the derivatives (gradients) w.r.t. the parameters, we will use them to update those parameters respectively.
Now, these updated values of parameters can be used to predict the target value in the next iteration and whole process is repeated again until the loss gets to minimum. Oh wait! What is that “lr” in the equation. Nah! Don’t worry. It is the learning rate which I described earlier.
From the above graph of loss (on lhs), we can now relate our understanding of the slope and the way we descend down to the minima of the graph at every epoch.
References
Footnotes:
Co-author: Abhash SInha
If you have any questions or suggestions, please feel free to reach out to us.
If you want to explore more about Machine Learning Algorithms, please feel free to check our other articles!