Towards Machine Learning - Supervised Algorithms
Machine learning investigates the research and development of algorithms that can learn from data and generate predictions based on data. Machine learning may adjust decisions based on more data, making it more efficient. Algorithms are collections of explicitly programmed instructions that computers apply to analyze or solve problems.
This article explains some of the popular supervised learning algorithms such as K-NN (K nearest neighbor), naive bayes, linear regression, and logistic regression.
This is the second article of the “Towards Machine Learning” series. You can check the previously published article on the following link:
Introduction
Before we talk about algorithms, let’s see how to calculate distances. Distances are usually used in supervised and unsupervised learning to calculate the similarity between data points. Effective distance measures can improve the performance of our machine learning models for both classification tasks and clustering.
Minkowski Metric
Minkowski Distance is the generalized form of Euclidean and Manhattan Distance.
- When p=1: Manhattan distance
- When p=2: Euclidean distance
What is supervised learning?
Supervised learning algorithms are a subcategory of machine learning. They are defined using labeled datasets to train algorithms to classify data or accurately predict outcomes. As input data is fed into the model, it adjusts its weights until the model has been fitted appropriately, which occurs as part of the cross-validation process.
Various real-world problems can be solved at scale with supervised learning, such as classifying spam into separate folders.
The following sections briefly explains some of the popular supervised learning algorithms:
- K-NN (K nearest neighbor)
- Naive Bayes
- Linear Regression
- Logistic Regression
K-NN (K nearest neighbor)
K-NN is a simple supervised learning algorithm. It is a nonparametric algorithm that doesn’t take assumptions since it assumes that all independent variables are unrelated. KNN assumes that two things are close based on similarities.
Advantages
- Can be used for classification and regression.
- Simple algorithm
Disadvantages
- High memory requirement: all of the training data must be present in memory to calculate the closest K.
Well, Okay. But, What is K?!
K is the number of nearest neighbors to use in most of the classifying process.
Things to consider when choosing K
- We should not use an even value for K if we are working on binary classification.
- But if we are working in multiple class classification, we can use an even value.
Steps of K-nearest neighbor
- Load the database
- Calculate distances
- Get nearest neighbor (The most frequent neighbors).
- Make prediction
Do all the neighbors affect our test instance the same way? Then, how to solve this issue?
Weighted k-NN
In a weighted KNN, nearby points are given more weight than farther away ones. This means that the higher weight is less distance. We classify the new point to the nearest distance.
Weights formula:
Distance weights formula:
Naïve Bayes Classifier
It is a supervised learning algorithms that uses a probabilistic approach for classification, which is naïve bayes theorem.
Advantages
- It works well with high dimensions
Disadvantages
- If the independent assumption does not hold then performance is very low.
Naïve bayes relies on “Bag of words”.
How does it work?
Suppose we have a text representing tweets as input, and we want to know whether the tweets were positive or negative. So, we could have a bag of positive words (such as good, nice, happy, and adorable) and a bag of negative words (such as hate, not, and bad).
We may then count the number of times each word appears in the document to classify the document as positive or negative.
In this above formula:
- Maximum a posteriori = most likely class.
- Document d represented as features x1; …; xn
- Arg max return the class that the document classified to it.
Assume the feature probabilities P (xi |cj) are independent given the class c.
So, we will get:
Learning the Naïve Bayes Model
Maximum likelihood estimation:
V is the vocabulary size.
Fraction of times word wi appears among all words in documents of topic cj.
What if we have seen no training documents?
The probability will be zero. But, zero probabilities cannot be conditioned away. To solve this problem, we should use Laplace (add-1) smoothing for Naïve Bayes
To avoid zero probability: use log.
Univariate Linear regression
It is a supervised learning algorithm that aims to represent data by the best-fitted model. Linear regression analysis is used to predict a variable’s value based on another variable’s value.
How to find the best approximated linear model?
The line with least error is the best fit.
First, we take observations, then try to find a straight line that will go through all the observations; this is called a regression line, and it’s based on the least squares method. So, in the end, I want to minimize the difference between the estimated value and the actual value, which means I want to minimize the errors.
Let’s suppose the independent variable is study time and dependent variable is GPA.
So, the estimated line ŷ is the estimated grade
B0 is the y intercept, and B1 is the slope.
If the relation between dependent and independent is positive, the slope will be positive, while if the relation is negative, the slope will be negative.
Suppose the estimated Yi is h (θ (xi)) and the actual observation is Y.
Where θ0 is the intersect and θ1 is the slope
So, the error will be:
Minimize the error for all examples:
To minimize the cost function: For the Linear regression model, the cost function will be the minimum of the mean squared error of the model, obtained by subtracting the predicted values from actual values. The cost function will be the minimum of these error values.
Logistic regression
It is supervised learning that aims to find the boundary that separates classes from each other in order to classify correctly.
In logistic regression, the estimated curve is a sigmoid function, which lies between 0 and 1.
And sigmoid function is:
Based on the sigmoid function, logistic regression will decide.
How to take decision?
If h (θ (xi)) >= 0.5, it predicts y=1. However, if h (θ (xi)) < 0.5, predict y=0
So, we observe that we need the θ that find the best boundary decision.
Example:
h θ (x) = g (θ 0 +θ 1x1 +θ 2x2). with Θ = [-3 1 1]
Predict y = 1 => -3+x1+x2 >= 0
x1+x2>=3 this is equation of line. So, the decision boundary will be linear line.
What if the cost function is not convex?
That is, if the cost function contains multiple local minimums. To solve this issue, we can change cost function.
Captures intuition that if h θ (x) = 0, (predict P (y=1jx; θ) = 0), but y = 1, we’ll penalize learning algorithm by a very large cost.
The cost function if y = 1 is the left curve, and the right curve is the cost function if y = 0.
Separately the two cost functions are convex.
General formula
The general formula will sum up the two separated curves into one function.
Apply gradient descent to approach global minimum.
We can conclude that logistic regression is best with binary classification.
What if we have more than two classes?
Apply one vs all method. It is a method for using binary classification algorithms for multi-class classification.
Each time you choose a class, make the other classes into one class so it will be a binary classifier. Then find the maximum h θ (x), then classify.
Summary
This article discusses the most basic supervised learning algorithms used to solve issues and make decisions.
After we’ve covered the supervised algorithms, we’ll look at the unsupervised learning algorithms in the following article.