Towards Machine Learning - Supervised Algorithms

Published in

Tech Blog

8 min readDec 25, 2022

Machine learning investigates the research and development of algorithms that can learn from data and generate predictions based on data. Machine learning may adjust decisions based on more data, making it more efficient. Algorithms are collections of explicitly programmed instructions that computers apply to analyze or solve problems.

This article explains some of the popular supervised learning algorithms such as K-NN (K nearest neighbor), naive bayes, linear regression, and logistic regression.

This is the second article of the “Towards Machine Learning” series. You can check the previously published article on the following link:

Towards Machine Learning — Terminologies | Tech Blog (medium.com)

Introduction

Before we talk about algorithms, let’s see how to calculate distances. Distances are usually used in supervised and unsupervised learning to calculate the similarity between data points. Effective distance measures can improve the performance of our machine learning models for both classification tasks and clustering.

Minkowski Metric

Minkowski Distance is the generalized form of Euclidean and Manhattan Distance.

When p=1: Manhattan distance
When p=2: Euclidean distance

What is supervised learning?

Supervised learning algorithms are a subcategory of machine learning. They are defined using labeled datasets to train algorithms to classify data or accurately predict outcomes. As input data is fed into the model, it adjusts its weights until the model has been fitted appropriately, which occurs as part of the cross-validation process.

Supervised learning (Reference: medium.com)

Various real-world problems can be solved at scale with supervised learning, such as classifying spam into separate folders.

The following sections briefly explains some of the popular supervised learning algorithms:

K-NN (K nearest neighbor)
Naive Bayes
Linear Regression
Logistic Regression

K-NN (K nearest neighbor)

K-NN is a simple supervised learning algorithm. It is a nonparametric algorithm that doesn’t take assumptions since it assumes that all independent variables are unrelated. KNN assumes that two things are close based on similarities.

Advantages

Can be used for classification and regression.
Simple algorithm

Disadvantages

High memory requirement: all of the training data must be present in memory to calculate the closest K.

Well, Okay. But, What is K?!

K is the number of nearest neighbors to use in most of the classifying process.

Things to consider when choosing K

We should not use an even value for K if we are working on binary classification.
But if we are working in multiple class classification, we can use an even value.

Steps of K-nearest neighbor

Load the database
Calculate distances
Get nearest neighbor (The most frequent neighbors).
Make prediction

Do all the neighbors affect our test instance the same way? Then, how to solve this issue?

Weighted k-NN

In a weighted KNN, nearby points are given more weight than farther away ones. This means that the higher weight is less distance. We classify the new point to the nearest distance.

Weights formula:

Distance weights formula:

Naïve Bayes Classifier

It is a supervised learning algorithms that uses a probabilistic approach for classification, which is naïve bayes theorem.

Bayes formula (where d is a document and c is a class)

Advantages

It works well with high dimensions

Disadvantages

If the independent assumption does not hold then performance is very low.

Naïve bayes relies on “Bag of words”.

How does it work?

Suppose we have a text representing tweets as input, and we want to know whether the tweets were positive or negative. So, we could have a bag of positive words (such as good, nice, happy, and adorable) and a bag of negative words (such as hate, not, and bad).

We may then count the number of times each word appears in the document to classify the document as positive or negative.

In this above formula:

Maximum a posteriori = most likely class.
Document d represented as features x1; …; xn
Arg max return the class that the document classified to it.

Assume the feature probabilities P (xi |cj) are independent given the class c.

Feature probabilities of independent classes

So, we will get:

Learning the Naïve Bayes Model

Maximum likelihood estimation:

V is the vocabulary size.

Fraction of times word wi appears among all words in documents of topic cj.

What if we have seen no training documents?

The probability will be zero. But, zero probabilities cannot be conditioned away. To solve this problem, we should use Laplace (add-1) smoothing for Naïve Bayes

To avoid zero probability: use log.

Univariate Linear regression

It is a supervised learning algorithm that aims to represent data by the best-fitted model. Linear regression analysis is used to predict a variable’s value based on another variable’s value.

How to find the best approximated linear model?

The line with least error is the best fit.

First, we take observations, then try to find a straight line that will go through all the observations; this is called a regression line, and it’s based on the least squares method. So, in the end, I want to minimize the difference between the estimated value and the actual value, which means I want to minimize the errors.

Regression line (Reference: StatisticsFun)

Let’s suppose the independent variable is study time and dependent variable is GPA.

So, the estimated line ŷ is the estimated grade

Estimated line (Reference: StatisticsFun)

B0 is the y intercept, and B1 is the slope.

If the relation between dependent and independent is positive, the slope will be positive, while if the relation is negative, the slope will be negative.

Suppose the estimated Yi is h (θ (xi)) and the actual observation is Y.

estimated line equation

Where θ0 is the intersect and θ1 is the slope

So, the error will be:

Minimize the error for all examples:

To minimize the cost function: For the Linear regression model, the cost function will be the minimum of the mean squared error of the model, obtained by subtracting the predicted values from actual values. The cost function will be the minimum of these error values.

Logistic regression

It is supervised learning that aims to find the boundary that separates classes from each other in order to classify correctly.

In logistic regression, the estimated curve is a sigmoid function, which lies between 0 and 1.

estimated curve

And sigmoid function is:

Based on the sigmoid function, logistic regression will decide.

How to take decision?

If h (θ (xi)) >= 0.5, it predicts y=1. However, if h (θ (xi)) < 0.5, predict y=0

So, we observe that we need the θ that find the best boundary decision.

Example:

h θ (x) = g (θ 0 +θ 1x1 +θ 2x2). with Θ = [-3 1 1]

Predict y = 1 => -3+x1+x2 >= 0

x1+x2>=3 this is equation of line. So, the decision boundary will be linear line.

What if the cost function is not convex?

That is, if the cost function contains multiple local minimums. To solve this issue, we can change cost function.

Captures intuition that if h θ (x) = 0, (predict P (y=1jx; θ) = 0), but y = 1, we’ll penalize learning algorithm by a very large cost.

The cost function if y = 1 is the left curve, and the right curve is the cost function if y = 0.

Separately the two cost functions are convex.

General formula

The general formula will sum up the two separated curves into one function.

Apply gradient descent to approach global minimum.

We can conclude that logistic regression is best with binary classification.

What if we have more than two classes?

Apply one vs all method. It is a method for using binary classification algorithms for multi-class classification.

Each time you choose a class, make the other classes into one class so it will be a binary classifier. Then find the maximum h θ (x), then classify.

Summary

This article discusses the most basic supervised learning algorithms used to solve issues and make decisions.

After we’ve covered the supervised algorithms, we’ll look at the unsupervised learning algorithms in the following article.

Towards Machine Learning - Supervised Algorithms

Introduction

Minkowski Metric

What is supervised learning?

K-NN (K nearest neighbor)

Advantages

Disadvantages

Things to consider when choosing K

Steps of K-nearest neighbor

Weighted k-NN

Naïve Bayes Classifier

Advantages

Disadvantages

How does it work?

Learning the Naïve Bayes Model

What if we have seen no training documents?

Univariate Linear regression

How to find the best approximated linear model?

Logistic regression

How to take decision?

What if the cost function is not convex?

General formula

What if we have more than two classes?

Summary

Written by Fatima Mubarak