Machine Learning I: Supervised Learning Explained in Details

Çağatay Tüylü
Çağatay Tüylü
Published in
6 min readAug 3, 2021

What’s Machine Learning?

Machine Learning is the science (and art) of programming computers so they can learn from data.

For example, your spam filter is a Machine Learning program that can learn to flag spam given examples of spam emails (e.g., flagged by users) and examples of regular (nonspam, also called “ham”) emails. The examples that the system uses to learn are called the training set. Each training example is called a training instance. This particular performance measure is called accuracy and it is often used in classification tasks.

Supervised Machine Learning

Supervised learning, also known as supervised machine learning, is defined by its use of labeled datasets to train algorithms to classify data or predict outcomes accurately. As input data is fed into the model, it adjusts its weights until the model has been fitted appropriately. This occurs as part of the cross-validation process to ensure that the model avoids overfitting or underfitting. Supervised learning helps organizations solve a variety of real-world problems at scale, such as classifying spam in a separate folder from your inbox. Some methods used in supervised learning include neural networks, naive Bayes, linear regression, logistic regression, random forest, support vector machine (SVM), and more.

Type of prediction

The different types of predictive models are summed up in the table above

Type of model

The different models are summed up in the table above

Hypothesis

The hypothesis is noted ​ and is the model that we choose. For a given input data x(i) the model prediction output is ​(x(i)).

A hypothesis is a proposed explanation for a phenomenon. For a hypothesis to be a scientific hypothesis, the scientific method requires that one can test it. Scientists generally base scientific hypotheses on previous observations that cannot satisfactorily be explained with the available scientific theories. Even though the words “hypothesis” and “theory” are often used synonymously, a scientific hypothesis is not the same as a scientific theory. A working hypothesis is a provisionally accepted hypothesis proposed for further research, in a process beginning with an educated guess or thought.

Loss function

A loss function is a function L:(z,y) ∈R×Y L(z,y)∈R that takes as inputs the predicted value z corresponding to the real data value y and outputs how different they are. The common loss functions are summed up in the table below:

In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some “cost” associated with the event.

Cost function

The cost function J is commonly used to assess the performance of a model, and is defined with the loss function L as follows:

Gradient descent

By noting α∈R the learning rate, the update rule for gradient descent is expressed with the learning rate and the cost function J as follows

Linear regression

Linear regression is a linear model, e.g. a model that assumes a linear relationship between the input variables (x) and the single output variable (y). More specifically, that y can be calculated from a linear combination of the input variables (x).

Classification and logistic regression

Logistic regression

Logistic Regression is used when the dependent variable(target) is categorical. A statistical model that in its basic form uses a logistic function to model a binary dependent variable, although many more complex extensions exist. In regression analysis, logistic regression (or logit regression) is estimating the parameters of a logistic model (a form of binary regression).

We assume here that y|x;\theta\sim\textrm{Bernoulli}(\phi)yx;θ∼Bernoulli(ϕ). We have the following form

It is the go-to method for binary classification problems (problems with two class values). In this post you will discover the logistic regression algorithm for machine learning.

Tree-based and ensemble methods

These methods can be used for both regression and classification problems.

CART

Classification and Regression Trees (CART), commonly known as decision trees, can be represented as binary trees. They have the advantage to be very interpretable.

Classification and Regression Trees or CART for short is a term introduced by Leo Breiman to refer to Decision Tree algorithms that can be used for classification or regression predictive modeling problems.

Random Forest

It is a tree-based technique that uses a high number of decision trees built out of randomly selected sets of features. Contrary to the simple decision tree, it is highly uninterpretable but its generally good performance makes it a popular algorithm.

A random forest is a machine learning technique that’s used to solve regression and classification problems. It utilizes ensemble learning, which is a technique that combines many classifiers to provide solutions to complex problems. A random forest algorithm consists of many decision trees.

How random forest algorithm works

Understanding decision trees

Decision trees are the building blocks of a random forest algorithm. A decision tree is a decision support technique that forms a tree-like structure. An overview of decision trees will help us understand how random forest algorithms work.

A decision tree consists of three components: decision nodes, leaf nodes, and the root nodes. A decision tree algorithm divides a training dataset into branches, which further segregate it into other branches. This sequence continues until a leaf node is attained. The leaf node cannot be segregated further.

The nodes in the decision tree represent attributes that are used for predicting the outcome. Decision nodes provide a link to the leaves. The following diagram shows the three types of nodes in a decision tree.

Boosting

Adaptive boosting

  • High weights are put on errors to improve at the next boosting step
  • Known as Adaboost

AdaBoost is one of the first boosting algorithms to be adapted in solving practices. Adaboost helps you combine multiple “weak classifiers” into a single “strong classifier”.

Gradient boosting

  • Weak learners are trained on residuals
  • Examples include XGBoost

Gradient boosting is a machine learning technique for regression, classification, and other tasks, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees.

How is XGBoost different from gradient boosting?

While regular gradient boosting uses the loss function of our base model (e.g. decision tree) as a proxy for minimizing the error of the overall model, XGBoost uses the 2nd order derivative as an approximation

Other non-parametric approaches

k-nearest neighbors

The k-nearest neighbor's algorithm, commonly known as k-N, is a non-parametric approach where the response of a data point is determined by the nature of its k neighbors from the training set. It can be used in both classification and regression settings.

Remark: the higher the parameter k, the higher the bias, and the lower the parameter k, the higher the variance.

My Github Account:

My Linkedin Account:

Source:

What is Machine Learning? | IBM

CS 229 — Supervised Learning Cheatsheet (stanford.edu)

Gradient descent — Wikipedia

Logistic Regression for Machine Learning (machinelearningmastery.com)

NotesHoeffding.pdf (ubc.ca)

--

--