Machine Learning

Sparsh Sihotiya
Electronics Club IITK
14 min readOct 4, 2020

What is Machine Learning?

Machine Learning is the field of study that gives computers the capability to learn without being explicitly programmed. It is one of the most exciting technologies that one would have ever come across. As it is evident from the name, it gives the computer the ability to learn.

For example, a machine learning algorithm used to recognize handwritten digits and it could also be used to classify emails into spam and not-spam with minor changes in the code.

Some Machine Learning Methods

Machine learning algorithms are often categorized as supervised or unsupervised.

Supervised Learning

Supervised learning is when the model is getting trained on a labelled dataset. A labelled dataset is one that has both input and output parameters.

Types of Supervised Learning:

1. Classification: It is a Supervised Learning task where output is having defined labels(discrete value). For example in above Figure A, Output — Purchased has defined labels i.e. 0 or 1; 1 means the customer will purchase, and 0 means that the customer won’t purchase.

It can be either binary or multi-class classification. In binary classification, the model predicts either 0 or 1; yes or no but in the case of multi-class classification, the model predicts more than one class.

Example: Gmail classifies the mails in different classes like social, promotions, updates, forums, spams.

2. Regression: It is a Supervised Learning task where output is having continuous value.
For example in the above Figure B, Output — Wind Speed is not having any discrete value but is continuous in the particular range. The goal here is to predict a value as much closer to the actual output value as our model can and then evaluation is done by calculating the error value. The smaller the error the greater the accuracy of our regression model.

Examples of Supervised Learning Algorithms:

1. Linear Regression

2. Nearest Neighbor

3. Gaussian Naive Bayes

4. Decision Trees

5. Support Vector Machine (SVM)

6. Random Forest

Unsupervised Learning

It’s a type of learning where we don’t give target to our model or we can say we are feeding unlabelled data while training i.e. training model has only input parameter values. The model itself has to find which way it can learn. Data-set in Figure A is mall data that contains information of its clients that subscribe to them. Once subscribed they are provided a membership card and so the mall has complete information about the customer and his/her every purchase. Now using this data and unsupervised learning techniques, the mall can easily group clients based on the parameters we want our algorithm to find.

Training data we are feeding is –

  1. Unstructured data: May contain noisy data, missing values, or unknown data.
  2. Unlabeled data: This type of data only contains a value for input parameters, there is no targeted value(output). It is easy to collect.

Types of Unsupervised Learning:-

  1. Clustering: This technique is broadly applied to group data based on the different patterns, which our machine model finds. For example, in the figure, we are not given an output parameter value, so Clustering will be used to group clients based on the input parameters given by our data.
  2. Association: This technique is a rule-based ML technique which finds out some very useful relations between parameters of a large data set. E.g. shopping stores use algorithms based on the Association to find out the relationship between the sale of one product w.r.t. to other sales based on customer behaviour. Once trained well, such models could be used to increase their sales by giving different offers.

Some algorithms which come under Unsupervised Learning:
1. K-Means Clustering
2. DBSCAN — Density-Based Spatial Clustering of Applications with Noise
3. BIRCH — Balanced Iterative Reducing and Clustering using Hierarchies
4. Hierarchical Clustering

Semi-supervised Learning

Its working lies between Supervised and Unsupervised techniques, as the name suggests. We use this technique when we are dealing with data that is a little bit labelled and a rest large portion of it is unlabeled. We can use the unsupervised techniques to predict labels and then feed these labels to supervised techniques. For eg: This technique is mostly applicable in the case of image data-set where usually all images are not labelled.

Reinforcement Learning

In this technique, the model keeps on increasing its performance using a Reward Feedback mechanism to learn the behaviour or a pattern. These types of algorithms are specific to a particular problem, these are efficient when the model needs to continuously update itself. E.g. Google self-Driving car, AlphaGo where a bot competes with humans and even itself to getting better and better performer of the Go Game.

Now, let’s get started with our First Machine Learning model.

Linear Regression

Linear Regression is a machine learning algorithm that performs a regression task.
It performs the task to predict a dependent variable (y) based on the independent variable(s) (x). So, this technique finds out a linear relationship between x (input) and y(output).

Linear Regression

In the above figure, X (is the input) is the work experience and Y (is the output) is the salary of a person. The regression line (here black line) is the best fit line for our model.

Hypothesis function for Linear Regression:

When training the model, we are given:
x: input training data
y: labels to data

When training the model — it fits the best fit line to predict the value of y for a given value of x. The model gets the best regression line by finding the most appropriate θ1 and θ2 values.
What is θ1? : intercept.
What is θ2? : coefficient of x

Once we find the most appropriate θ1 and θ2 values, we get the best fit line. So, on using our model for prediction, it will predict the value of y for the input x.

How to update θ1 and θ2 values to get the best fit line?

The line for which the error(residual) between the observed values and the predicted values is minimum is called the regression line. These errors(residuals) can be visualized by the vertical lines from the observed data value to the regression line.

To define and measure the error for our model we define the cost function as the sum of the squares of the residuals(error). The cost function is denoted by

Where h(x) is the hypothesis and m is the total number of training examples.

Our main objective is to find the parameters i.e. θi so that the cost function is minimum. We will use Gradient Descent to do this.

Gradient Descent is an optimization algorithm that is used in many machine learning algorithms. It iteratively modifies the parameters of the model in order to minimize the cost function.
Below are the steps to implement Gradient Descent:

We first initialize the model parameters with some random values. This step is called Random initialization.

Now, we need to measure how the cost function changes with change in its parameters. Therefore, we compute the partial derivatives of the cost function w.r.t. to the parameters θ₀, θ₁, … θₙ.

Similarly, the partial derivative of the cost function w.r.t. to any parameter can be denoted by

We can compute the partial derivatives for all parameters at once using

where h(x) (hypothesis) is

After computing the derivative, we update the parameters as given below :

where α is the learning parameter.
All the parameters can be updated at once using :

The steps 2,3 are repeated until the cost function converges to the minimum value.
If the value of α is too small, the cost function takes a larger time to converge.
If α is too large, gradient descent may overshoot the minimum and may finally fail to converge.
So, we need to choose an optimal value of alpha.

Implementing Linear Regression from Scratch

Linear Regression Using Scikit-Learn :

We have learned about the concepts of linear regression and gradient descent. We also have implemented the model using the sci-kit-learn library as well.

Next, we will be implementing Logistic Regression.

Logistic Regression

Logistic regression is basically a supervised classification algorithm. In a classification problem, the target variable, y, can take only discrete values for a given set of features, X.
It models the data using the sigmoid function.

On the basis of the number of categories, Logistic regression can be classified as:
1. Binomial: As the name suggests, in this type, the target variable or y can have only 2 possible value types: “0” or “1” which may represent “pass” vs “fail”, “dead” vs “alive”, etc.
2. Multinomial: In this type, the target variable can have 3 or more possible types that are not ordered like “disease A” vs “disease B” vs “disease C.”
3. Ordinal: This type deals with target variables with ordered categories. For example, a test score can be categorized into basically: “very poor”, “poor”, “good”, “very good.” So, here each category can be given a score like 0, 1, 2, 3.

Hypothesis and Cost Function
I expect by now we have a basic understanding of how Logistic Regression can be used for classification. Here, we will define the hypothesis and the cost function.

This model can be represented by the equation,

We then apply the sigmoid function to the output of the linear regression.

where the sigmoid function is represented by,

The hypothesis for logistic regression then becomes,

If the weighted sum of inputs is greater than zero, the predicted class is 1 and vice-versa. So the decision boundary separating both the classes can be found by setting the weighted sum of inputs to 0.

Cost Function

Like Linear Regression, we will define a cost function for our model and the objective will be to minimize the cost.

The cost function for a single training example can be given by:

Cost function intuition

If the actual class is 1 and the model predicts 0, the model should be highly penalized and vice-versa. As it can be seen from the below picture, for the plot -log(h(x)) as h(x) approaches 1, the cost is 0 and as h(x) nears 0, the cost is infinity(that is we penalize the model heavily). Similarly, for the plot -log(1-h(x)) when the actual value is 0 and the model predicts 0, the cost is 0 and the cost becomes infinity as h(x) approaches 1.

On combining both the equations:

The cost for all the training examples is denoted by J(θ) and can be computed by taking the average over the cost of all the training samples

where m is the number of training samples.

We will use gradient descent to minimize the cost function. The gradient w.r.t. any parameter can be given by

The equation is similar to what we achieved in Linear Regression, only h(x) is different in both cases.

Implementing Logistic Regression

For implementing LR we need a dataset, so the database can be downloaded from here.

Let’s make the Logistic Regression model, predicting whether a user will purchase the product or not.

Support Vector Machines

SVM stands for Support Vector Machines. It is a classification algorithm that is used to maximize the distance of the data points from the hyperplane, that is, by maximizing the margins. A hyperplane is an (n-1) dimensional space for n-dimensional space. Therefore, for a 3-D space, a hyperplane is a plane in the space, while in 2-d space, a hyperplane is a line.
In order to implement SVM for classification into more than 2 classes, ONE vs. ALL algorithm can be used. One advantage of using SVM over KNN can be that it finds the best line to classify our data points and is therefore highly accurate. However, on large datasets, it can take time.
SVM is of two types Linear and Nonlinear. The categories are based on the type of decision boundaries.

Linear SVM
Non-Linear SVM

Why we are using SVMs in Machine Learning?

The main reason for using SVM is that it can perform both regression and classification tasks on linear as well as non-linear data.
Also, it can also be used for handwriting detection, face detection, email classification, and many other applications.
Here are some of the pros and cons of using SVMs.

Pros

1. Effective on datasets with multiple features, like financial or medical data.

2. Effective in cases where the number of features is greater than the number of data points.

3. Uses a subset of training points in the decision function called support vectors which makes it memory efficient.

4. Different kernel functions can be specified for the decision function. You can use common kernels, but it’s also possible to specify custom kernels.

Cons

1. If the number of features is a lot bigger than the number of data points, avoiding over-fitting when choosing kernel functions and regularization terms are crucial.

2. SVMs don’t directly provide probability estimates. Those are calculated using an expensive five-fold cross-validation.

3. SVMs work best on small sample sets because of their high training time.

Implementation of SVM

Here are the steps regularly found in machine learning projects to implement SVM:
1. First of all, we need to Import dataset.
2. Then, explore the data to figure out what they look like.
3. After the second step, pre-process the data.
4. Then, split the data into attributes and labels.
5. Next, comes the task to divide the data into training and testing sets
6. After that, train the SVM algorithm.
7. Then, use the model to make some predictions.
8. Finally, evaluate the results of the algorithm.

Code for implementation of Linear SVM :

Code for implementation of Non-Linear SVM :

K-Nearest Neighbours (KNN)

KNN is a type of clustering algorithm used to cluster the data into classes and helps us to predict the class to which a particular data belongs. The value of K is dependant on the user. Here, K means the number of the nearest data points which are to be taken into consideration while calculating the distance of the data points from the cluster center.

K Nearest neighbours work in the following way-

1. First of all, load the data

2. Then, initialize the value of k

3. After that, in order to get the predicted class, iterate from 1 to the total number of training data points.

(a) First of all, calculate the distance between test data and each row of training data. Here we will use Euclidean distance as our distance metric since it’s the most popular method. The other metrics that can be used are Chebyshev, cosine, etc.

(b) After that, sort the calculated distances in ascending order based on distance values.

© Then, get the top k rows from the sorted array

(d) After that, get the most frequent class of these rows

(e) Finally, return the predicted class.

The above algorithm can be easily implemented using Scikit-Learn. It contains the sklearn.neighbours.KNeighboursClassifier which can be easily used to implement the algorithm.

Few pros and cons of kNN

Pros

1. It is simple and easy to interpret.

2. It does not make any assumption, so it can also be implemented in non-linear tasks.

3. This algorithm works well on classification with multiple classes.

4. This algorithm works well on both the classification as well as regression tasks.

Cons

1. This algorithm becomes very slow as the number of data points increases because the model needs to store all data points.

2. It is also not memory efficient.

3. It is also sensitive to outliers.

Implementation of kNN

Decision Trees and Random Forests

Decision Trees can be used in both classification and regression tasks. It offers easy visualization and is based on an algorithm that is capable of taking decisions at each node of the tree until it reaches a leaf. One disadvantage of decision trees is that they are prone to overfitting, and hence have to be used wisely. To overcome the bias error and the overfitting problem, Random Forests can be used.

Random Forests are basically a bunch of decision trees used together. They help in avoid overfitting as the result is decided by a majority of the ‘votes’ of the individual decision trees. Random Forests achieve to have uncorrelated decision trees by bootstrapping and feature extraction

Random Forests runs the decision trees in parallel, and hence the time is not affected. Also, it can handle different feature types together and does not require normalization or scaling. However, for high dimensional data, some faster algorithms like Naive Bayes can be used.

Pros and Cons of Decision Trees

Pros

1. This algorithm is usually not needed to normalize or scale features.

2. It is also suitable to work on a mixture of feature data types (continuous, categorical, binary).

3. It is easy to interpret

Cons

1. This algorithm is very much prone to overfitting and it needs to be ensembled in order to generalize well.

Pros and Cons of Random Forests

Pros

1. It is a highly powerful, highly accurate model for many different real problems.

2. It is like decision trees, does not require normalization or scaling

3. This model is like decision trees, can handle different feature types together

4. It runs the trees in parallel so the performance is not affected

Cons

1. This model is not a good choice for high-dimensional data sets (i.e. text classification) compared fast linear models (i.e. Naive Bayes)

That’s all for this article.

Happy Learning :)

--

--