The Ultimate Guide to Machine Learning: Machine Learning Algorithms — Part-4

Simranjeet Singh
30 min readMar 4, 2023

--

Introduction

Machine learning is a branch of artificial intelligence that focuses on the development of algorithms and statistical models that allow computers to learn and improve on their own. Machine learning algorithms are used to automatically identify patterns in data and extract useful insights that can then be used to make predictions or inform decision-making processes.

Each algorithm takes a different approach to data learning and can be used to resolve various issues. Checkout previous part of this Series for getting knowledge about Data collection, data cleaning and preprocessing, feature selection and engineering, statistical modelling.

Part-1 : Exploratory Data Analysis

Part-2 : Feature Engineering

Part-3 : Statistics and Statistical modelling

Model selection and training, model evaluation, and model deployment are typical stages in the machine learning process that are discussed in this blog.

Fig.1 — Machine Learning Algorithms

In this blog post, we will go over the various types of machine learning algorithms and their applications, as well as the machine learning process. We’ll also show you some popular machine learning algorithms and how they’re used in real-world scenarios.

👉 Before Starting the Blog, Please Subscribe to my YouTube Channel and Follow Me on Instagram 👇
📷 YouTube — https://bit.ly/38gLfTo
📃 Instagram — https://bit.ly/3VbKHWh

👉 Do Donate 💰 or Give me Tip 💵 If you really like my blogs, Because I am from India and not able to get into Medium Partner Program. Click Here to Donate or Tip 💰 — https://bit.ly/3oTHiz3

Table of Contents

  1. Types of Machine Learning Algorithms
  2. Overview of Machine Learning Process
  3. Regression Algorithms
  • Linear Regression
  • Polynomial Regression
  • Ridge Regression
  • Lasso Regression
  • Elastic Net Regression
  • Decision Tree Regression
  • Random Forest Regression
  • Gradient Boosting Regression
  • Support Vector Regression

4. Classification Algorithms

  • Logistic Regression
  • k-Nearest Neighbors (k-NN)
  • Naive Bayes
  • Decision Tree Classification
  • Random Forest Classification
  • Gradient Boosting Classification
  • Support Vector Machine (SVM)
  • Artificial Neural Networks (ANNs)

5. Clustering Algorithms

  • k-Means Clustering
  • Hierarchical Clustering
  • DBSCAN
  • Gaussian Mixture Models (GMMs)

6. Tips on how to choose the right algorithm

7. Conclusion

Types of Machine Learning Algorithms

There are four main types of machine learning algorithms: supervised learning, unsupervised learning, semi-supervised learning, and reinforcement learning.

Fig.2 — Types of Machine Learning Algorithms
  1. Supervised Learning: Supervised learning is a subset of machine learning in which the algorithm is trained on labelled data with known input and output. Based on the input data, the algorithm learns to predict the output. A supervised learning algorithm, for example, can be trained to predict the price of a house based on factors such as location, size, and number of rooms. Linear Regression, Logistic Regression, Decision Trees, Random Forests, and Support Vector Machines are examples of supervised learning algorithms.
  2. Unsupervised Learning:Unsupervised learning is a type of machine learning in which the algorithm is trained on unlabeled data, which means that the input data is known but the output data is unknown. Without any prior knowledge of what it is looking for, the algorithm learns to find patterns and relationships in data. An unsupervised learning algorithm, for example, can be used to segment customers based on their purchasing behaviour. Clustering, Principal Component Analysis, and Association Rule Mining are examples of unsupervised learning algorithms.
  3. Semi-Supervised Learning: Semi-supervised learning is a subset of machine learning in which the algorithm is trained using both labelled and unlabeled data. The algorithm learns to predict and find patterns in unlabeled data by using labelled data. A semi-supervised learning algorithm, for example, can be used to classify emails as spam or not spam based on a small number of labelled emails and a large number of unlabeled emails. Label Propagation and Expectation-Maximization are two semi-supervised learning algorithms.
  4. Reinforcement Learning: Reinforcement learning is a type of machine learning in which the algorithm learns through interaction with the environment and feedback in the form of rewards or punishments. The algorithm learns to take actions that maximise rewards while minimising penalties. A reinforcement learning algorithm, for example, can be used to teach a robot to navigate a maze by rewarding it when it finds the correct path and punishing it when it makes a wrong turn. Q-Learning and Policy Gradient are two examples of reinforcement learning algorithms.

Overview of Machine Learning Process

The Machine Learning process is a comprehensive and systematic approach to developing and deploying machine learning models. It entails several steps, each of which is critical to the overall success of the process. Here is a step-by-step breakdown of the Machine Learning process:

1. Data Collection: Collecting relevant data is the first step in any Machine Learning project. This information can come from a variety of sources, including databases, APIs, and web scraping tools. The data should be clean, structured, and in an easily processed format.
Let’s say you want to forecast house prices in a specific neighbourhood. You could gather information on recent home sales, neighbourhood demographics, local school ratings, and crime statistics.

2. Data Preparation: After gathering the data, you must prepare it for analysis. This may entail cleaning the data, removing duplicates, and transforming it into a format suitable for machine learning algorithms. This step is critical for ensuring that the data is accurate and reliable.

For example, you may need to remove missing values, convert categorical variables to numerical ones, and scale the data to ensure that all features are on a similar scale.

Fig.3 — Machine Learning Process Overview

3. Model Selection: The next step is to choose a machine learning model that is appropriate for the problem at hand. This entails comprehending the advantages and disadvantages of various algorithms and selecting one that is best suited to the data and the problem at hand.

You could, for example, use a regression model to forecast house prices, a classification model to detect spam emails, or a clustering model to segment customers based on their purchasing habits.

4. Model Training: After you’ve decided on a model, the next step is to train it on the data. This involves feeding the model examples of input data and their corresponding outputs so that it can learn to predict new, unknown data.

For example, you could use a portion of your data to train a regression model that predicts house prices based on various home and neighbourhood features.

5. Model Evaluation: Following the training of the model, the next step is to evaluate its performance on a new set of data. This is done to ensure that the model generalises well to new, previously unseen data and is not overfitting to the training data.

For example, you could evaluate the performance of your regression model using data that was not used in the training process. To evaluate the model’s accuracy, you could use metrics such as mean squared error or R-squared.

6. Model Tuning: If the model’s performance isn’t up to par, you may need to tweak its parameters or make changes to its architecture. This entails adjusting various hyperparameters, such as learning rate, regularisation strength, and layer count, to improve the model’s performance.

For example, to improve your regression model’s performance on the evaluation data, you could try adjusting the regularisation strength.

7. Model Deployment: The final step is to put the model into production so that it can make predictions on new, previously unseen data. This entails integrating the model into a larger software system and ensuring that it can handle real-time requests in an efficient and reliable manner.

You could, for example, use your regression model as part of a web application that allows users to enter information about a home and receive a real-time predicted price.

I. Regression Algorithms

Regression algorithms are a type of machine learning algorithm that uses input data to predict continuous numerical values. We will look at four popular regression algorithms in this section: linear regression, polynomial regression, ridge regression, and lasso regression or more.

1. Linear Regression

Linear regression is a simple and widely used regression algorithm based on the assumption of a linear relationship between the input variables (predictors) and the output variable (response). It attempts to fit a straight line through the data points that best represents the data’s trend.

Fig.3 — Linear Regression

The mathematical equation of linear regression is:

y = β0 + β1*x + ε

where:

  • y is the dependent variable (also called response or target)
  • x is the independent variable (also called predictor or feature)
  • β0 is the y-intercept (the value of y when x = 0)
  • β1 is the slope of the regression line (the change in y for a one-unit change in x)
  • ε is the error term (the difference between the actual y value and the predicted y value)

Consider the following example of predicting house prices based on house size. We can estimate the price of the house based on its size using linear regression. Here’s a Python linear regression example using the scikit-learn library:

from sklearn.linear_model import LinearRegression
import numpy as np

# Define the input and output data
X = np.array([1, 2, 3, 4, 5]).reshape(-1, 1)
y = np.array([1.2, 2.2, 2.8, 4.0, 5.1])

# Train the linear regression model
reg = LinearRegression().fit(X, y)

# Predict the output for a new input
new_X = np.array([6]).reshape(-1, 1)
print(reg.predict(new_X))

When to use:

  • When the relationship between the independent and dependent variable is linear
  • When you need to make predictions based on continuous data
  • When there is a need for simple and interpretable model

Advantages:

  • Linear regression is simple and easy to interpret.
  • It can be trained quickly and is computationally efficient.

Disadvantages:

  • It assumes a linear relationship between the input variables and the output variable, which may not always be the case.
  • It can be sensitive to outliers.

2. Polynomial Regression

Polynomial regression is a linear regression extension that allows for nonlinear relationships between input variables and output variables. It attempts to fit a polynomial curve through the data points that best represents the data’s trend.

Fig.4 — Polynomial Regression
Fig.2.1 — Equation of Polynomial Regression

Consider the same example of predicting house prices based on house size. We can estimate the price of the house based on its size using polynomial regression. Here is a Python example of polynomial regression using the scikit-learn library:

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
import numpy as np

# Define the input and output data
X = np.array([1, 2, 3, 4, 5]).reshape(-1, 1)
y = np.array([1.2, 2.2, 2.8, 4.0, 5.1])

# Transform the input data into polynomial features
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)

# Train the polynomial regression model
reg = LinearRegression().fit(X_poly, y)

# Predict the output for a new input
new_X = np.array([6]).reshape(-1, 1)
new_X_poly = poly.transform(new_X)
print(reg.predict(new_X_poly))

When to use:

  • When the relationship between the independent and dependent variable is nonlinear
  • When the data distribution is curved or parabolic
  • When the model needs to capture more complex relationships between the variables

Advantages:

  • Polynomial regression can capture nonlinear relationships between the input variables and the output variable.
  • It can be more accurate than linear regression for certain types of data.

Disadvantages:

  • It can be sensitive to outliers.
  • It may be prone to overfitting if the degree of the polynomial is too high.

3. Ridge Regression

Ridge regression is a regularised version of linear regression that adds a penalty term to the loss function (the square of the magnitude of the coefficients) to prevent overfitting. A hyperparameter that must be tuned controls the penalty term.

Fig.5 — Ridge Regression

Mathematical equation of Ridge Regression in rendered latex format:

Fig.3.1 — Ridge Regression Equation

Consider the following example of predicting house prices based on house size. We can estimate the price of the house based on its size using ridge regression. Here’s a Python example of ridge regression using the scikit-learn library:

import numpy as np
from sklearn.linear_model import Ridge

# Generate random data for house sizes and prices
X = np.random.rand(100, 1) * 10
y = 1 + 2*X + np.random.rand(100, 1)

# Fit the ridge regression model
ridge = Ridge(alpha=1)
ridge.fit(X, y)

# Predict the price of a house with size 5
size = np.array([[5]])
price = ridge.predict(size)

print("Predicted price:", price)

When to use:

  • When there is a multicollinearity among the predictors
  • When overfitting is a concern
  • When there is a need for a model that performs well on new data

Advantages of Ridge Regression:

  • Helps to prevent overfitting by shrinking the coefficients
  • Works well even when the number of features is greater than the number of observations
  • Can be used for feature selection by setting some coefficients to zero

Disadvantages of Ridge Regression:

  • Requires careful selection of the regularization parameter (alpha) to balance bias and variance
  • Assumes that all predictors are relevant and contribute to the outcome
  • May not perform well if the predictors are highly correlated

4. Lasso Regression

Lasso regression, also known as Least Absolute Shrinkage and Selection Operator, is similar to ridge regression except that it includes an L1 penalty term (magnitude) in the loss function. This penalty term reduces the model’s complexity by reducing some of the coefficients to zero, effectively performing feature selection. When there are many features in the dataset and only a subset of them are relevant to the outcome variable, this can be useful.

Fig.6- Lasso Regression

Here’s an example of using Lasso regression in Python with scikit-learn:

from sklearn.linear_model import Lasso
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load the Boston housing dataset
boston = load_boston()

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size=0.2, random_state=42)

# Fit a Lasso regression model
lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)

# Make predictions on the test set
y_pred = lasso.predict(X_test)

# Calculate the mean squared error
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

The mathematical equation of Lasso Regression is:

Fig. 4.1 — Lasso Regression

When to use:

  • When there is a need to select only the most important features
  • When there is a multicollinearity among the predictors
  • When overfitting is a concern

Advantages:

  • Can perform feature selection by shrinking coefficients to zero.
  • Works well when there are many irrelevant features in the dataset.

Disadvantages:

  • Can have high variance and be sensitive to outliers.
  • May not work well when there are many relevant features in the dataset.

5. Elastic Net Regression

Elastic Net regression is a Lasso-Ridge regression hybrid that combines the L1 and L2 penalty terms. This is useful when the dataset contains many features, some of which are correlated with one another.

Fig.7 — Elastic Net Regression

Here’s an example of using Elastic Net regression in Python with scikit-learn:

from sklearn.linear_model import ElasticNet
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load the Boston housing dataset
boston = load_boston()

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size=0.2, random_state=42)

# Fit an Elastic Net regression model
enet = ElasticNet(alpha=0.1, l1_ratio=0.5)
enet.fit(X_train, y_train)

# Make predictions on the test set
y_pred = enet.predict(X_test)

# Calculate the mean squared error
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

The objective function of Elastic Net Regression is given as:

Fig.5.1 — Elastic Net Regression

When to use:

  • When there is a need to balance between Lasso and Ridge regression
  • When there are many features but only a few are important
  • When there is a need for a model that performs well on new data

Advantages:

  • Can perform feature selection and handle correlated features.
  • Works well when there are many irrelevant features in the dataset.

Disadvantages:

  • Can have high variance and be sensitive to outliers.
  • May not work well when there are many relevant features in the dataset.

6. Decision Tree Regression

Decision tree regression models make predictions using a tree-like structure based on a series of simple decisions based on features in the dataset. Each internal node of the tree represents a feature-based decision, while each leaf node represents a prediction.

Fig.8- Decision Tree Regression

Here’s an example of using decision tree regression in Python with scikit-learn:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeRegressor

# Generate random data
np.random.seed(0)
n_samples = 100
X = np.sort(np.random.rand(n_samples, 1), axis=0)
y = np.sin(2 * np.pi * X).ravel()
y += 0.1 * np.random.randn(n_samples)

# Fit the model
regr = DecisionTreeRegressor(max_depth=2)
regr.fit(X, y)

# Predict
X_test = np.arange(0.0, 1.0, 0.01)[:, np.newaxis]
y_test = regr.predict(X_test)

# Plot the results
plt.figure()
plt.scatter(X, y, s=20, edgecolor="black", c="darkorange", label="data")
plt.plot(X_test, y_test, color="cornflowerblue", label="max_depth=2", linewidth=2)
plt.xlabel("data")
plt.ylabel("target")
plt.title("Decision Tree Regression")
plt.legend()
plt.show()

The mathematical equation for a decision tree regression model is:

Fig.6.1 — Decision Tree Regression

When to use:

  • When the relationship between the independent and dependent variable is nonlinear
  • When the model needs to capture complex interactions between the predictors
  • When the data has many features, and some are more important than others

Advantages:

  • Easy to understand and interpret.
  • Can handle both categorical and numerical data.
  • Handle nonlinear relationships between features and the target variable

Disadvantages:

  • Can have high variance and be sensitive to small changes in the data.
  • Can easily overfit to the training data.
  • It may not be the best choice when the data contains a large number of features or when the data is noisy or contains missing values.

Watch this full video on Linear Regression and Decision Tree with Python Implement and Complete End to End Project.

Complete Tutorial on Regression and Trees with Project

7. Random Forest Regression

Random Forest is a popular ensemble learning method for making more accurate predictions by combining multiple decision trees. Multiple decision trees are built on different subsets of the training data in Random Forest Regression, and the final prediction is made by averaging the predictions of all the trees.

Fig.9 — Random Forest Regressor

Here’s an example of using Random Forest Regression in Python with scikit-learn:

from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_regression

# Create a random dataset
X, y = make_regression(n_features=4, n_informative=2,
random_state=0, shuffle=False)

# Initialize the model
reg = RandomForestRegressor(max_depth=2, random_state=0)

# Train the model
reg.fit(X, y)

# Make a prediction
X_test = [[0, 0, 0, 0]]
print(reg.predict(X_test))

The mathematical equation for the prediction of the random forest regression model can be expressed as:

Fig.7.1 — Random Forest Regression

When to use:

  • When there is a need to handle missing data
  • When there is a need to handle outliers and noise
  • When there is a need for a model that performs well on new data

Advantages:

  • It can handle large datasets with high dimensionality
  • It can handle missing values and maintain accuracy
  • It reduces overfitting compared to decision trees

Disadvantages:

  • It may take longer to train than a single decision tree
  • It can be difficult to interpret the model due to the complexity of multiple trees

8. Gradient Boosting Regression

Gradient Boosting is another ensemble learning method that combines multiple weak models to make a more accurate prediction. Gradient Boosting Regression builds decision trees in a sequential manner, with each tree attempting to correct the errors made by the previous tree.

Fig.10 — Gradient Boosting Regression

Here’s an example of using Gradient Boosting Regression in Python with scikit-learn:

from sklearn.ensemble import GradientBoostingRegressor
from sklearn.datasets import make_regression

# Create a random dataset
X, y = make_regression(n_features=4, n_informative=2,
random_state=0, shuffle=False)

# Initialize the model
reg = GradientBoostingRegressor(random_state=0)

# Train the model
reg.fit(X, y)

# Make a prediction
X_test = [[0, 0, 0, 0]]
print(reg.predict(X_test))

The mathematical equation can be represented as follows:

Fig.8.1 — Gradient Boosting Regression

When to use:

  • When there is a need to improve the performance of decision tree-based models
  • When the data has many features, and some are more important than others
  • When there is a need for a model that performs well on new data

Advantages:

  • It can handle both regression and classification problems
  • It can handle missing values and maintain accuracy
  • It can be more accurate than Random Forest on some datasets

Disadvantages:

  • It can be sensitive to overfitting
  • It may take longer to train than Random Forest

9. Support Vector Regression

SVR is a supervised learning algorithm that can be used for both regression and classification problems. The algorithm in Support Vector Regression attempts to find the hyperplane with the greatest margin between the support vectors and the regression line.

Fig.11 — Support Vector Regression

Here’s an example of using Support Vector Regression in Python with scikit-learn:

from sklearn.svm import SVR
from sklearn.datasets import make_regression

# Create a random dataset
X, y = make_regression(n_features=4, n_informative=2,
random_state=0, shuffle=False)

# Initialize the model
reg = SVR()

# Train the model
reg.fit(X, y)

# Make a prediction
X_test = [[0, 0, 0, 0]]
print(reg.predict(X_test))

The primal optimization problem for linear Support Vector Regression can be written as:

Fig.9.1 — SVM Regression

The kernel trick can be used to map the feature space into a higher-dimensional space where the data may be separable by a hyperplane for non-linear SVR.

The dual problem of the linear SVR is:

Fig.9.2 — Dual Problem of Linear SVR

When to use:

  • When the data has a clear boundary between the classes
  • When there is a need for a model that generalizes well to new data
  • When the data has many features, and some are more important than others

Advantages:

  • It can handle non-linear regression problems
  • It can handle high-dimensional datasets
  • It is less prone to overfitting compared to other algorithms

Disadvantages:

  • It can be sensitive to the choice of kernel function
  • It can be computationally expensive for large datasets

II. Classification Algorithms

Classification algorithms are a type of machine learning algorithm that uses features or attributes to predict the class or category of a given observation. There are various classification algorithms, each with its own set of advantages and disadvantages. In this section, we will go over some of the most commonly used classification algorithms, as well as their benefits, drawbacks, and Python examples.

1. Logistic Regression

Logistic regression is a statistical method that models a binary dependent variable using a logistic function. It is commonly used in classification problems, especially binary classification tasks. It forecasts the likelihood of an input belonging to a specific class.

Fig,12 — Logistic Regression

Here’s an example of using logistic regression in Python with scikit-learn:

from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3, random_state=0)

logreg = LogisticRegression()
logreg.fit(X_train, y_train)

y_pred = logreg.predict(X_test)

The mathematical equation for Logistic Regression:

Fig.1.1 — Logistic Regression

When to use:

  • When the target variable is binary or categorical
  • When the relationship between the independent and dependent variable is expected to be linear or can be linearized through transformations
  • When the emphasis is on interpreting the coefficients

Advantages:

  • It is simple and easy to implement.
  • It can handle binary and multiclass classification problems.
  • It can be used for both linear and nonlinear problems.

Disadvantages:

  • It is prone to overfitting, especially when the feature space is large.
  • It may not perform well when the classes are not well-separated.

2. k-Nearest Neighbors (k-NN)

The k-NN algorithm is a non-parametric algorithm for classification and regression problems. It works by locating the k closest training samples in the feature space and classifying the test sample using their labels.

Fig.13 — K-nearest Neighbors

Here’s an example of using k-NN in Python with scikit-learn:

from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3, random_state=0)

knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)

y_pred = knn.predict(X_test)

There is no single mathematical equation that describes how the k-Nearest Neighbors (k-NN) algorithm works. The k-NN algorithm’s general concept can be summarised as follows:

Fig.2.1 — KNN Algorithm

When to use:

  • When the dataset has a small number of features
  • When the dataset has a large number of training examples
  • When the classes in the dataset are well separated

Advantages:

  • It is simple and easy to implement.
  • It can handle both binary and multiclass classification problems.
  • It can be used for both linear and nonlinear problems.

Disadvantages:

  • It can be sensitive to the choice of k.
  • It can be computationally expensive, especially when the feature space is large.

3. Naive Bayes Classification

Naive Bayes is a probabilistic algorithm based on the theorem of Bayes. The term “naive” refers to the assumption that the features are independent of one another. It’s popular for text classification and spam filtering.

Fig.14- Naive Bayes Classification

Here’s an example of using Naive Bayes in Python with scikit-learn:

from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3, random_state=0)

nb = GaussianNB()
nb.fit(X_train, y_train)

y_pred = nb.predict(X_test)

The mathematical equation of the Naive Bayes classification algorithm can be written as follows:

Fig. 3.1 — Naive Bayes Classification

When to use:

  • When the dataset has a large number of features
  • When the independence assumption holds for the features
  • When there is limited training data

Advantages:

  • It is simple and easy to implement.
  • It can handle both binary and multiclass classification problems.
  • It can be used for both linear and nonlinear problems.

Disadvantages:

  • It assumes that the features are independent of each other, which may not be true in practice.
  • It can be sensitive to the choice of prior probabilities.

4. Decision Tree Classification

For classification and regression issues, decision trees are a straightforward and understandable technique. It operates by recursively dividing the feature space into smaller areas according to the importance of the features.

Fig.15- Decision Tree Classification

Here’s an example of using decision tree classification in Python with scikit-learn:

from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the iris dataset
iris = load_iris()

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3, random_state=42)

# Create the decision tree classifier
clf = DecisionTreeClassifier()

# Fit the classifier to the training data
clf.fit(X_train, y_train)

# Predict the classes of the testing data
y_pred = clf.predict(X_test)

# Calculate the accuracy of the classifier
accuracy = accuracy_score(y_test, y_pred)

print("Accuracy:", accuracy)

Since decision trees are a non-parametric technique that do not require calculating coefficients or model fitting, the mathematical equation for decision tree classification is not clearly stated. Instead, decision trees are built by recursively splitting the data according to the features’ values, with the aim of generating subgroups that are as homogeneous as feasible with regard to the target variable.

But, as illustrated below, a series of if-then statements can be used to describe the prediction function for decision tree classification:

Fig.4.1 — Decision Tree Classification

When to use:

  • When the relationship between features and target is non-linear
  • When the decision boundaries are complex and not easily separable by a linear classifier
  • When interpretability is important, as decision trees can be easily visualized and understood by humans.

Advantages:

  • Simple and intuitive
  • Can handle both categorical and numerical data
  • Can capture non-linear relationships between features and target

Disadvantages:

  • Can overfit the data, especially if the tree is deep
  • Can be sensitive to small variations in the training data
  • Can be biased towards features with many categories

5. Random Forest Classification

An ensemble learning approach called random forest mixes various decision trees to increase the robustness and accuracy of the classification model. It operates by building a large number of decision trees during training period and then producing the class that represents the mean of the predictions (regression) or classifications of all the individual trees.

Fig.16- Random Forest Classification

Here’s an example of using random forest classification in Python with scikit-learn:

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, n_features=4, n_informative=2, n_redundant=0, random_state=0, shuffle=False)

clf = RandomForestClassifier(max_depth=2, random_state=0)
clf.fit(X, y)

print(clf.predict([[0, 0, 0, 0]]))

The mathematical formula for the Random Forest Classification method is more complex than that of other algorithms, such as logistic regression or linear regression. Yet, we can quantitatively articulate the fundamental tenet of Random Forest Classification.

Fig.5.1 — Random Forest Classification

When to use:

  • When the dataset has a large number of features that are irrelevant or redundant.
  • When the dataset is imbalanced or has missing values.
  • When the decision boundary is complex and nonlinear.

Advantages:

  • Can handle large datasets with high dimensionality.
  • Provides feature importance scores.
  • Reduces overfitting compared to decision trees.

Disadvantages:

  • Computationally expensive.
  • Hard to interpret the model due to the ensemble of decision trees.
  • Can overfit if the number of trees is too large.

6. Gradient Boosting Classification

Another ensemble learning approach that combines a number of weak learners to produce a strong learner is gradient boosting. Gradient boosting, on the other hand, uses a collection of weighted rules (such as linear models, decision stumps) as opposed to decision trees to iteratively enhance the model performance.

Fig.17- Gradient Boosting Classification

Here’s an example of using gradient boosting classification in Python with XGBoost:

import xgboost as xgb
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, n_features=4, n_informative=2, n_redundant=0, random_state=0, shuffle=False)

dtrain = xgb.DMatrix(X, label=y)
param = {'max_depth': 2, 'eta': 1, 'objective': 'binary:logistic'}
num_round = 2

bst = xgb.train(param, dtrain, num_round)
bst.predict(xgb.DMatrix([[0, 0, 0, 0]]))

The mathematical equation for Gradient Boosting Classification:

Fig.6.1- Gradient Boosting Classification

When to use:

  • When the dataset is imbalanced or has missing values.
  • When the decision boundary is complex and nonlinear.
  • When the model needs to be optimized for a specific evaluation metric (e.g. AUC, F1 score).

Advantages:

  • Handles heterogeneous features and provides good accuracy.
  • Can be used with different loss functions.
  • Provides feature importance scores.

Disadvantages:

  • Computationally expensive.
  • Can overfit if the number of trees or the depth of the trees is too large.
  • Sensitive to hyperparameters.

7. Support Vector Machine (SVM)

The popular classification technique SVM finds the best hyperplane to divide the data points into various classes. By utilising kernel functions, it can handle both linear and nonlinear decision boundaries.

Fig.18 — SVM Classification

Here’s an example of using SVM classification in Python with scikit-learn:

from sklearn import svm
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, n_features=4, n_informative=2, n_redundant=0, random_state=0, shuffle=False)

clf = svm.SVC(kernel='linear', C=1, gamma='auto')
clf.fit(X, y)

print(clf.predict([[0, 0, 0, 0]]))

Here’s the equation for Support Vector Machine (SVM):

Fig.7.1 — SVM Classification

When to use:

  • When the dataset has a small number of features and a large number of samples.
  • When the decision boundary is complex and nonlinear.
  • When the model needs to be optimized for a specific evaluation metric (e.g. AUC, F1 score).

Advantages:

  • Effective for high-dimensional datasets.
  • Provides good generalization ability and robustness against overfitting.
  • Can handle non-linearly separable data using kernel functions.

Disadvantages:

  • Can be sensitive to the choice of kernel function and its parameters.
  • Computationally expensive.
  • Requires careful preprocessing of the data.

8. Artificial Neural Networks (ANNs)

The structure and operation of the human brain served as the inspiration for the ANN family of machine learning models. They are made up of numerous layers of linked neurons that modify the input data in nonlinear ways in order to understand intricate patterns and relationships.

Fig.19- Artificial Neural Network Classification

Here’s an example of using ANNs for image classification using the popular TensorFlow library:

import tensorflow as tf

# Load the MNIST dataset
mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()

# Normalize the input data
x_train, x_test = x_train / 255.0, x_test / 255.0

# Build the neural network model
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10, activation='softmax')
])

# Compile the model
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])

# Train the model
model.fit(x_train, y_train, epochs=5)

# Evaluate the model on the test data
model.evaluate(x_test, y_test)

The mathematical equations for Artificial Neural Networks (ANNs) can be quite complex and can vary depending on the specific architecture and type of neural network being used. However, the basic equation for a feedforward neural network with one hidden layer can be expressed as follows:

Fig.8.1 — ANN for Classification

When to use:

  1. Difficult challenges involving big volumes of data: ANNs are ideally suited to situations involving a lot of features or input variables because they can learn complicated patterns and relationships in vast datasets.
  2. Non-linear problems: ANNs are especially helpful for tackling nonlinear issues like picture or speech recognition, where the connections between the inputs and outputs are intricate and challenging for conventional algorithms to understand.
  3. Time-series analysis: ANNs’ capacity to recognise and forecast patterns across time makes them useful in this type of analysis. They can be employed to predict upcoming values or spot irregularities in time-series data.

Advantages:

  • ANNs can learn complex nonlinear relationships between input and output variables, making them powerful tools for a wide range of applications.
  • They can handle large and high-dimensional datasets with many features and can perform well even when the input data is noisy or incomplete.
  • ANNs are also able to perform feature selection and extraction automatically, reducing the need for manual feature engineering.

Disadvantages:

  • ANNs can be computationally intensive, requiring significant computing resources and time to train and optimize the models.
  • They are also prone to overfitting, especially when the models are too complex or when the dataset is small.
  • The lack of interpretability and transparency of ANNs can also be a limitation, making it difficult to understand how the model arrives at its predictions.

III. Clustering Algorithms

An unsupervised machine learning method called clustering is used to put related data points together. Without the use of labels or pre-defined categories, patterns and structures in the data must be found. Applications for clustering algorithms include customer segmentation, image segmentation, anomaly detection, and many more.

1. k-Means Clustering

A common unsupervised clustering approach called k-Means divides a dataset into k clusters according to how similar the data points are to the cluster centres. The algorithm’s objective is to reduce the sum of squared distances between every data point and the closest cluster centre.

Fig.20-K-Means Clustering

Example: Let’s consider a dataset of customer transactions, and we want to group similar customers based on their purchasing behavior. We can use k-means clustering to segment the customers into k clusters based on their purchase history.

Python Code:

from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=3, random_state=0)
kmeans.fit(X)
labels = kmeans.predict(X)

The mathematical equation for k-means clustering can be written as follows:

Fig.1.1 — K-Means Clustering

The goal is to reduce the sum of squared distances between each observation and the cluster centre that is given to it. The algorithm adjusts the cluster centres to reflect the mean of the observations assigned to the cluster after iteratively assigning observations to the nearest cluster centre. Until the cluster assignments stop changing or a set number of repetitions has been reached, this process is repeated.

When to Use:

  • k-Means is useful when the number of clusters k is known beforehand, and the data is evenly distributed.

Advantages:

  • Simple and easy to implement.
  • Works well for large datasets.

Disadvantages:

  • Sensitive to initial cluster centers.
  • Cannot handle outliers effectively.

2. Hierarchical Clustering

Hierarchical clustering is another popular unsupervised clustering algorithm that builds a hierarchy of clusters by iteratively merging or splitting clusters based on the similarity of data points.

Fig.21 — Hierarchical clustering

Example: Let’s consider a dataset of animal features, and we want to group similar animals based on their features. We can use hierarchical clustering to create a dendrogram that shows the clusters at different levels of the hierarchy.

Python code:

from scipy.cluster.hierarchy import linkage, dendrogram

linkage_matrix = linkage(X, method='ward')
dendrogram(linkage_matrix, truncate_mode='level', p=3)

Hierarchical Clustering is a family of algorithms that do not rely on a single mathematical equation to compute clusters. Instead, they create a hierarchy of clusters using a dendrogram, which is a branching diagram showing the relationships between the clusters.

When to Use:

  • Hierarchical clustering is useful when the number of clusters k is not known beforehand, and the data has a hierarchical structure.

Advantages:

  • Can handle any type of data.
  • Provides a visualization of the hierarchy of clusters.

Disadvantages:

  • Computationally expensive for large datasets.
  • Not suitable for datasets with varying densities.

3. DBSCAN

DBSCAN is a density-based clustering algorithm that groups together data points that are close to each other in the feature space and separates outliers.

Fig.22- DBSACN Algorithm

There is no direct mathematical equation for DBSCAN. However, the algorithm works by defining two parameters:

  1. Epsilon (E) — a radius within which the algorithm searches for nearby points, also known as the “neighborhood” of a data point
  2. Minimum Points — a minimum number of points within the $\epsilon$ radius that must be present for a data point to be considered as a core point.
  3. Core points: These are the points that have at least minPts number of points within a radius of epsilon (i.e., the points that lie within the epsilon neighborhood of the core point). These points are considered as the heart of the cluster.
  4. Border points: These are the points that have fewer than minPts points within a radius of epsilon but lie in the epsilon neighborhood of a core point. These points belong to the cluster, but they are not considered as significant as core points.
  5. Noise points: These are the points that do not belong to any cluster. They are the points that do not have minPts points within the radius of epsilon, and they are not in the epsilon neighborhood of a core point. These points are considered outliers and are not used in clustering.

Example: Let’s consider a dataset of customer transactions, and we want to group customers based on their purchase history. We can use DBSCAN to identify high-value customers who make frequent purchases.

Python Code:

from sklearn.cluster import DBSCAN

dbscan = DBSCAN(eps=0.5, min_samples=5)
dbscan.fit(X)
labels = dbscan.labels_

When to Use:

  • DBSCAN is useful when the data has varying densities, and the number of clusters is not known beforehand.

Advantages:

  • Can handle datasets with varying densities.
  • Can identify outliers effectively.

Disadvantages:

  • Sensitive to the choice of hyperparameters.
  • Cannot handle datasets with large feature spaces.

4. Gaussian Mixture Models (GMMs)

GMMs are probabilistic clustering algorithms that model the data distribution as a mixture of Gaussian distributions and estimate the parameters of the Gaussians using the Expectation-Maximization (EM) algorithm.

Fig.23-Gaussians Mixture Model

Example: Let’s consider a dataset of customer transactions, and we want to group customers based on their purchase history. We can use GMMs to model the purchase behavior as a mixture of Gaussians and estimate the parameters of the Gaussians.

Python Code:

from sklearn.mixture import GaussianMixture

gmm = GaussianMixture(n_components=3, covariance_type='full', random_state=0)
gmm.fit(X)
labels = gmm.predict(X)

The mathematical equation of Gaussian Mixture Models can be expressed as:

Fig.4.1 — GMM Model Equation

When to use:

  • GMMs are suitable for clustering data that is not linearly separable, has overlapping clusters, and follows a mixture distribution.
  • They are commonly used in image and speech recognition, as well as in gene expression analysis.

Advantages:

  • Can model complex cluster shapes and accommodate overlapping clusters.
  • Provides probabilities of each data point belonging to each cluster.
  • Can handle missing data points.

Disadvantages:

  • Computationally expensive, especially when the number of dimensions and data points are large.
  • May converge to a local minimum instead of the global minimum, leading to suboptimal results.
  • Requires careful selection of the number of clusters and initialization of parameters.

Tips on How to choose the right algorithm

Choosing the right algorithm for a given problem can be challenging. Here are some tips to help you:

Fig.24- Best Algorithm
  1. Understand the problem: Before selecting an algorithm, you need to understand the problem you are trying to solve. You need to know the type of problem, whether it’s a classification, regression or clustering problem.
  2. Evaluate the data: You need to evaluate the data you have, the size of the data, the number of features and the distribution of the data. Some algorithms work better with large datasets, while others work well with smaller datasets.
  3. Consider the assumptions: Some algorithms have assumptions about the data, such as linearity or normality. You need to consider these assumptions before selecting an algorithm.
  4. Consider the complexity: Some algorithms are more complex than others, and they may take longer to train and test. You need to consider the complexity of the algorithm and the resources you have available.
  5. Experiment: You need to experiment with different algorithms and evaluate their performance. You can use cross-validation techniques to evaluate the performance of different algorithms.
  6. Choose the best algorithm: After evaluating the performance of different algorithms, you need to choose the best algorithm that works well for your problem.

Conclusion

In conclusion, machine learning algorithms are a crucial tool for resolving challenging issues in numerous industries. Due to the fact that every algorithm has unique strengths and limitations, it is essential to choose the best algorithm for a given task. Following are some important conclusions:

  1. For the proper algorithm to be chosen, it is essential to comprehend the issue domain. Think about the kind of data you are using and the kind of output you want to generate.
  2. Try out various algorithms to see which one solves your problem the best. This procedure can entail tweaking each algorithm’s hyperparameters and evaluating the outcomes.
  3. While choosing an algorithm, take into account the trade-offs between accuracy, speed, interpretability, and scalability. Some elements may be more crucial than others depending on the issue.
  4. Follow the most recent advances in machine learning science and engineering. Keeping up with new developments in algorithms and methodologies might help you select the best strategy for your challenge.

For Coding and Examples, Checkout my GitHub Profile.

If you like the article and would like to support me make sure to:

👏 Clap for the story (100 Claps) and follow me 👉🏻Simranjeet Singh

📑 View more content on my Medium Profile

🔔 Follow Me: LinkedIn | Medium | GitHub | Twitter | Telegram

🚀 Help me in reaching to a wider audience by sharing my content with your friends and colleagues.

🎓 If you want to start a career in Data Science and Artificial Intelligence and you do not know how? I offer data science and AI mentoring sessions and long-term career guidance.

📅 Consultation or Career Guidance

📅 1:1 Mentorship — About Python, Data Science, and Machine Learning

Book your Appointment

--

--

Simranjeet Singh

Data Scientist | Blogger | YouTuber | MLOPS | Machine Learning and Deep Learning | NLP | Azure/AWS