Towards Machine Learning - Supervised Algorithms

Fatima Mubarak
Tech Blog
Published in
8 min readDec 25, 2022

Machine learning investigates the research and development of algorithms that can learn from data and generate predictions based on data. Machine learning may adjust decisions based on more data, making it more efficient. Algorithms are collections of explicitly programmed instructions that computers apply to analyze or solve problems.

Reference: edsurge.com

This article explains some of the popular supervised learning algorithms such as K-NN (K nearest neighbor), naive bayes, linear regression, and logistic regression.

This is the second article of the “Towards Machine Learning” series. You can check the previously published article on the following link:

Introduction

Before we talk about algorithms, let’s see how to calculate distances. Distances are usually used in supervised and unsupervised learning to calculate the similarity between data points. Effective distance measures can improve the performance of our machine learning models for both classification tasks and clustering.

Minkowski Metric

Minkowski Distance is the generalized form of Euclidean and Manhattan Distance.

Compute distance
  • When p=1: Manhattan distance
  • When p=2: Euclidean distance
Minkowski distances

What is supervised learning?

Supervised learning algorithms are a subcategory of machine learning. They are defined using labeled datasets to train algorithms to classify data or accurately predict outcomes. As input data is fed into the model, it adjusts its weights until the model has been fitted appropriately, which occurs as part of the cross-validation process.

Supervised learning (Reference: medium.com)

Various real-world problems can be solved at scale with supervised learning, such as classifying spam into separate folders.

The following sections briefly explains some of the popular supervised learning algorithms:

  1. K-NN (K nearest neighbor)
  2. Naive Bayes
  3. Linear Regression
  4. Logistic Regression

K-NN (K nearest neighbor)

K-NN is a simple supervised learning algorithm. It is a nonparametric algorithm that doesn’t take assumptions since it assumes that all independent variables are unrelated. KNN assumes that two things are close based on similarities.

Advantages

  • Can be used for classification and regression.
  • Simple algorithm

Disadvantages

  • High memory requirement: all of the training data must be present in memory to calculate the closest K.

Well, Okay. But, What is K?!

Reference: tenor.com

K is the number of nearest neighbors to use in most of the classifying process.

Things to consider when choosing K

  • We should not use an even value for K if we are working on binary classification.
  • But if we are working in multiple class classification, we can use an even value.

Steps of K-nearest neighbor

  1. Load the database
  2. Calculate distances
  3. Get nearest neighbor (The most frequent neighbors).
  4. Make prediction

Do all the neighbors affect our test instance the same way? Then, how to solve this issue?

Selecting K neighbors

Weighted k-NN

In a weighted KNN, nearby points are given more weight than farther away ones. This means that the higher weight is less distance. We classify the new point to the nearest distance.

Weights formula:

Weights formula

Distance weights formula:

Distance weight formula

Naïve Bayes Classifier

It is a supervised learning algorithms that uses a probabilistic approach for classification, which is naïve bayes theorem.

Bayes formula (where d is a document and c is a class)

Advantages

  • It works well with high dimensions

Disadvantages

  • If the independent assumption does not hold then performance is very low.

Naïve bayes relies on “Bag of words”.

How does it work?

Suppose we have a text representing tweets as input, and we want to know whether the tweets were positive or negative. So, we could have a bag of positive words (such as good, nice, happy, and adorable) and a bag of negative words (such as hate, not, and bad).

We may then count the number of times each word appears in the document to classify the document as positive or negative.

Bag of words
Bayes classifier

In this above formula:

  • Maximum a posteriori = most likely class.
  • Document d represented as features x1; …; xn
  • Arg max return the class that the document classified to it.

Assume the feature probabilities P (xi |cj) are independent given the class c.

Feature probabilities of independent classes

So, we will get:

Bayes classifier

Learning the Naïve Bayes Model

Maximum likelihood estimation:

MLE

V is the vocabulary size.

Fraction of times word wi appears among all words in documents of topic cj.

Prior probablity

What if we have seen no training documents?

The probability will be zero. But, zero probabilities cannot be conditioned away. To solve this problem, we should use Laplace (add-1) smoothing for Naïve Bayes

Laplace smoothing

To avoid zero probability: use log.

log function

Univariate Linear regression

It is a supervised learning algorithm that aims to represent data by the best-fitted model. Linear regression analysis is used to predict a variable’s value based on another variable’s value.

How to find the best approximated linear model?

The line with least error is the best fit.

First, we take observations, then try to find a straight line that will go through all the observations; this is called a regression line, and it’s based on the least squares method. So, in the end, I want to minimize the difference between the estimated value and the actual value, which means I want to minimize the errors.

Regression line (Reference: StatisticsFun)

Let’s suppose the independent variable is study time and dependent variable is GPA.

So, the estimated line ŷ is the estimated grade

Estimated line (Reference: StatisticsFun)

B0 is the y intercept, and B1 is the slope.

(Reference: StatisticsFun)

If the relation between dependent and independent is positive, the slope will be positive, while if the relation is negative, the slope will be negative.

Suppose the estimated Yi is h (θ (xi)) and the actual observation is Y.

estimated line equation

Where θ0 is the intersect and θ1 is the slope

So, the error will be:

error

Minimize the error for all examples:

minimize error

To minimize the cost function: For the Linear regression model, the cost function will be the minimum of the mean squared error of the model, obtained by subtracting the predicted values from actual values. The cost function will be the minimum of these error values.

cost function

Logistic regression

It is supervised learning that aims to find the boundary that separates classes from each other in order to classify correctly.

In logistic regression, the estimated curve is a sigmoid function, which lies between 0 and 1.

estimated curve

And sigmoid function is:

Sigmoid function

Based on the sigmoid function, logistic regression will decide.

Logistic classifier

How to take decision?

If h (θ (xi)) >= 0.5, it predicts y=1. However, if h (θ (xi)) < 0.5, predict y=0

So, we observe that we need the θ that find the best boundary decision.

Example:

Distributed data

h θ (x) = g (θ 0 +θ 1x1 +θ 2x2). with Θ = [-3 1 1]

Predict y = 1 => -3+x1+x2 >= 0

x1+x2>=3 this is equation of line. So, the decision boundary will be linear line.

Decision boundary

What if the cost function is not convex?

That is, if the cost function contains multiple local minimums. To solve this issue, we can change cost function.

Cost function

Captures intuition that if h θ (x) = 0, (predict P (y=1jx; θ) = 0), but y = 1, we’ll penalize learning algorithm by a very large cost.

The cost function if y = 1 is the left curve, and the right curve is the cost function if y = 0.

Changing cost function

Separately the two cost functions are convex.

General formula

The general formula will sum up the two separated curves into one function.

Cost function

Apply gradient descent to approach global minimum.

Gradient descent

We can conclude that logistic regression is best with binary classification.

What if we have more than two classes?

Apply one vs all method. It is a method for using binary classification algorithms for multi-class classification.

One vs all method

Each time you choose a class, make the other classes into one class so it will be a binary classifier. Then find the maximum h θ (x), then classify.

Summary

This article discusses the most basic supervised learning algorithms used to solve issues and make decisions.

After we’ve covered the supervised algorithms, we’ll look at the unsupervised learning algorithms in the following article.

--

--

Fatima Mubarak
Tech Blog

Data scientist @montymobile | In my writing, I explore the fields of data science , machine learning and related topics.