Machine Learning Overview

6 min readFeb 9, 2023

Machine Learning is a subset of artificial intelligence which focuses mainly on machine learning from their experience and making predictions based on its experience. The idea of machine learning enables a machine to learn from examples and experience without having to be explicitly programmed. Therefore, rather than writing the code, you simply feed the generic algorithm with the data, and the algorithm or machine creates the logic based on the given data.

Tasks in Machine Learning:

1. Supervised Learning — The use of labelled datasets distinguishes the machine learning strategy known as supervised learning. These datasets are intended to “supervise” or “train” algorithms to correctly classify data or forecast outcomes. Labelled inputs and outputs allow the model to monitor its precision and improve over time.

Under supervised learning we have:

i. Regression — Another supervised learning technique that employs an algorithm to comprehend the link between dependent and independent variables is regression. Regression models are useful for making predictions about numbers based on several data points, such as sales revenue forecasts for a certain company.

Example — Real Estate Prediction, Weather Forecasting

Algorithms used for regression —

a. Linear Regression — Linear regression attempts to model the relationship between two variables by fitting a linear equation to observed data. One variable is considered to be an explanatory variable, and the other is considered to be a dependent variable.

b. Decision Tree — The general motive of using Decision Tree is to create a training model which can use to predict class or value of target variables by learning decision rules inferred from prior data (training data).

c. Random Forest — The random forest algorithm establishes the outcome based on the predictions of the decision trees. It predicts by taking the average or mean of the output from various trees. Increasing the number of trees increases the precision of the outcome.

d. XG Boost/Gradient Boost — Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. XG boost is simply an improvised version of the GBM algorithm! The working procedure of XG boost is the same as GBM.

Evaluation Metric used — MSE, RMSE, MAE, MAPE, R2 Score, Adjusted R2 Score

ii. Classification — Using an algorithm, classification issues correctly categorize test data into distinct groups, such as distinguishing apples from oranges. Alternately, supervised learning algorithms can be applied in the real world to categorize spam in a distinct folder from your email.

Example — Spam Detection, Credit Card Fraud Detection

Algorithms used for classification -

a. Logistic Regression — A logistic regression produces a logistic curve, which is limited to values between 0 and 1. Logistic regression models the probability that each input belongs to a particular category.

b. K-Nearest Neighbors — K-nearest neighbors (KNN) algorithm is a type of supervised ML algorithm which can be used for both classification as well as regression problems.

c. Support Vector Machines — SVM is based on the idea of finding a hyperplane that best separates the features into different domains. Hyperplanes are decision boundaries that help classify the data points. Data points falling on either side of the hyperplane can be attributed to different classes.

d. Naïve Bayes Classifier — A Naive Bayes classifier is a probabilistic machine learning model that’s used for classification task. It is based on Bayes’ Theorem with an assumption of independence among features. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.

Note: — Decision Tree, Random Forest, Ensemble techniques are used for both regression and classification problems (CART)

Evaluation Metric used — Accuracy, Precision, Recall, F1 Score, Confusion Matrix, AUC/ROC Curve

2. Unsupervised Learning — Machine learning algorithms are used in unsupervised learning to examine and group unlabeled data sets. Without the aid of humans, these algorithms find hidden patterns in data.

Under unsupervised learning we have:

i. Clustering — Unlabeled data can be grouped using the data mining approach of clustering based on how similar or dissimilar the data is.

Algorithms used for Clustering:

a. K-Means Clustering — It assigns data points to a cluster such that the sum of the squared distance between the data points and the cluster’s centroid is at the minimum. The less variation we have within clusters, the more homogeneous (similar) the data points are within the same cluster.

b. Hierarchical Clustering — There are two approaches namely Agglomerative and Divisive Hierarchical Clustering.

Agglomerative — The population is divided into numerous clusters using an unsupervised machine learning technique so that data points within the same cluster are more similar and those within separate clusters are less similar.

Divisive — It is opposite to agglomerative.

ii. Association — Another form of unsupervised learning technique is association, which employs various criteria to discover connections between variables in a given dataset. The “Customers Who Bought This Item Also Bought” recommendation engine and market basket analysis both regularly employ these techniques.

Evaluation Metric used — Silhouette score, Rand Index, Adjusted Rand Index

How do we choose an optimal model?

For choosing an optimal model we need to understand the concept of Bias — Variance trade off.

Bias refers to the error due to the model’s simplistic assumptions in fitting the data. A high bias means that the model is unable to capture the patterns in the data and this results in under-fitting.

Variance refers to the error due to the complex model trying to fit the data. High variance means the model passes through most of the data points and it results in over-fitting the data.
As the model complexity increases, the bias decreases and the variance increase and vice-versa.

Ideally, a machine learning model should have low variance and low bias. But practically it’s impossible to have both. Therefore, to achieve a good model that performs well both on the train and unseen data, a trade-off is made. data.

Ways to improve accuracy of the model and prevent it from being overfit:

i. Hyperparameter tuning — Finding a set of ideal hyperparameter values for a learning algorithm and using this tuned algorithm on any data set is hyperparameter tuning. The model’s performance is maximised by using that set of hyperparameters, which minimises a predetermined loss function and resulting in better results with fewer errors.

ii. Cross validation — Cross-validation is a technique for evaluating ML models by training several ML models on subsets of the available input data and evaluating them on the complementary subset of the data. Example — Gridsearch CV, Randomizedsearch CV, Bayesian optimization

ii. Regularization Techniques — In the regularization technique, we reduce the magnitude of the independent variables by keeping the same number of variables”. It maintains accuracy as well as a generalization of the model. Example: Lasso, Ridge

Machine Learning Overview

Written by Jaolekarjahnavi