Introduction to Machine Learning and Deep Dive into Linear Regression

Vardaan Bajaj

Published in

Analytics Vidhya

10 min readJun 15, 2020

In this post, we’ll be going through:

1. Introduction to Machine Learning

2. Different types of Machine Learning problems

3. Cost function for Linear Regression

4. Gradient Descent Algorithm

5. Hands on Multivariate Linear Regression problem on a Kaggle Dataset.

What is Machine Learning?

Remember the first code you wrote on a computer? Majority of us start with ‘Hello World’ programs but think of the time you wrote something that made you feel proud of yourself. For most of the computer science enthusiasts out there, these might be either solving the entire programming assignment on your own or solving competitive programming questions. Let’s not forget that for the majority of us, these have been the initial steps that have helped us dive into the world of computer science.

Since computers perform deterministic operations, all these programs have the same template. We supply the logic (rules) for the program and the input (data) and the program gives us the desired output. But for machine learning, things are different. Machine learning is a field of computer science where we supply the input (data) and the output and leave it to the machine learning algorithms to generate the rules for the input-output mapping. Once these rules are generated, we can use them to make predictions for any given input data related to this context.

In this series of blogs, we’ll be exploring Machine Learning in depth by looking at all the concepts, some of the mathematics behind it to gain insights into working of machine learning algorithms and at the end of each post, we’ll be solving a real-life problem (preferably from Kaggle) using the concepts mentioned.

There are 2 ways we can make a computer learn things we want it to, either spoon-feed (supervised learning) it or leave it to explore (unsupervised learning).

Supervised Learning

How humans learn through examples is an analogy to supervised learning. In supervised learning, we provide our machine a labelled data set i.e. we provide both, the input and the output and by applying one of the various supervised machine learning algorithms (models), we expect our machine learning model to learn the mapping between input and output. Supervised learning algorithms are divided into 2 broad categories:

(i) Classification Algorithms

(ii) Regression Algorithms

In classification problems, the output is divided into a fixed number of classes and the job of the classification algorithm is to predict the probability of the class that most likely matches with the data fed into the algorithm.

In regression problems, the output can be any number (in case of numerical output) and the job of the regression algorithm is to predict the closest numerical value for the data fed to it.

Image source: https://www.javatpoint.com/regression-vs-classification-in-machine-learning

Supervised learning has found its applications in Healthcare, Natural Language Processing, Stock Price Prediction systems to name a few.

Unsupervised Learning

In unsupervised learning, we feed our algorithm with unlabelled data and leave it to the algorithm to find patterns in the data. Unsupervised learning problems are quite overwhelming to work on since there is no simple way to evaluate their performance. Unsupervised learning algorithms are divided into 2 broad categories:

(i) Clustering Algorithms

(ii) Association Algorithms

Clustering deals with finding a structure or pattern in a collection of unlabelled data. Clustering algorithms process data and find natural clusters (groups) if they exist in the data. We can tell our algorithm to categorize a certain number of clusters.

Image source: https://www.datacamp.com/community/tutorials/k-nearest-neighbor-classification-scikit-learn

Association is concerned with establishing associations among data objects in large databases. One example is the recommendation system of Amazon where after buying a certain item, it suggests the user other products to buy.

In this post and the upcoming posts, we’ll dive into supervised machine learning algorithms and problems. After getting a good look into supervised learning problems, we’ll have a look into unsupervised learning problems as well.

One of the most simple supervised machine learning technique is Linear Regression. Linear Regression can be easily implemented in python using sci-kit learn library with the assistance of pandas and numpy, but first let us understand how this algorithm works.

Linear Regression

Consider a straight line y=Wx+b. In coordinate geometry terms, x and y are variables, W is the slope of this line and b is the intercept on the y axis. We supply various x values to the this equation and the corresponding y values on the graph result in a straight line, given the values of W and b. In machine learning terms, ‘x’ is the ‘input data’, y is the ‘output’ and W and b are the parameters that we want the linear regression algorithm to learn to give the ‘rules’ so that it can generalize to new input data.

Division of Training and Test Sets

We need a good amount of data for our training algorithms to learn the x -> y mappings (rules) in order to get optimum values of W and b. For datasets with < 10,000 records, it is preferred to have an 80–20 train-test split i.e. only 80% of the data is used to train the model and the rest 20% is used to evaluate its performance. Additionally, we can also use development set to tune the model’s parameters but the problem we’ll be solving later in this post is a small dataset, so we’ll not use dev set for now.

How does learning take place?

For a good linear regression model, we want W and b values to be an optimum one i.e. they learn from the input data and acquire such values so that they can generalize results to new data. So, the main question that arises is how to get the optimum values for these parameters. For this let’s have a look at the cost function.

Cost function for Linear Regression

Let’s denote the predicted values of the linear regression model by y_hat. The main idea behind the cost function is to obtain W and b such that y_hat is close to y for our training examples represented by (x,y) pairs. In other words, we are going to solve a minimization problem on W and b and our cost function will enable us in doing so. Let’s denote our cost function with J. The cost function will help us choose the optimum values of W and b. There are tons of cost functions out there but the one we’ll be using for linear regression is very intuitive.

The above image is an example of Linear Regression (with one variable i.e. y=Wx+b) in action, where the blue dots are the training examples and the red line is the output of a linear regression model. This line passes roughly between all the blue points. In other words, this line has the minimum overall distance from all the possible lines that exists in this 2D plane, which leads us to using the ‘Euclidean Distance’ as the loss function. For the training example (x_i, y_i) and the corresponding prediction (x_i, y_hat _i), Euclidean Distance is defined as:

L = (y_hat_i — y_i)²

Note that we have removed the square root without the loss of generality. Now, we need to find the values for W and b that minimize this loss function. Here, we’ll make use of derivatives.

Considering that we have ‘m’ training examples the cost function is defined as:

J = sum_over_all_training_examples((y_hat_i — y_i)²)

Remember that we initialize W and b randomly, so, since we have a quadratic equation, these random values will give us a value at some point in this parabolic graph. Our goal is to reach the minima of the graph. By taking the derivative of the of the cost function J wrt W and b gives us the slope of the graph and we want to continue this process until our slope becomes ideally zero (practically very close to zero).

Let’s think for a moment about the parameters needed to achieve this objective. We want to make sure that we crawl towards the minima, not too slow, not too fast. Crawling slowly towards the minima involves taking small steps and moving rapidly towards the minima involves taking large steps and sometimes these large strides can result in jumping over the minima to the point of no return. Clearly, we need to define a parameter that controls this process. In machine learning, we call this parameter the ‘learning rate’ denoted by ‘alpha’. The algorithm that helps us perform this shift towards the minima is called Gradient Descent.

Gradient Descent

Gradient Descent is a self-explanatory term and the whole algorithm lies in the meaning of this term. Gradient is just another name for slope/derivative we talked about previously and descent just means to drop/fall. So, the job of gradient descent is to take the starting point on the cost function curve and just drop it at every step until it reaches the minima and report the optimum values of the parameters (here W and b). Let us define the variables ‘delta_W’ and ‘delta_b’ that help us to terminate the algorithm at convergence. If the change between 2 consecutive updates of the W is less than the ‘delta_W’ and that of b is ‘delta_b’, we can terminate the gradient descent algorithm. Pseudo code for gradient descent is as follows:

temp_W = temp_b = random_values
flag_W = flag_b = False
do
if ((abs(W — temp_W) > delta_W) and flag_W = False)
    temp_W = W — alpha * partial_derivate_wrt_W(J)
if ((abs(b — temp_b) > delta_b) and flag_b = False)
    temp_b = b — alpha * partial_derivative_wrt_b(J)
if(temp_W == W)
    flag_W = True
if(temp_b == b)
    flag_b = True
W = temp_W
b = temp_bwhile True

Gradient descent is implemented with a lot of optimizations in python’s machine learning libraries. Our main aim here was just to get an understanding of how gradient descent works.

The formula for partial derivatives for the cost function (here Euclidean distance) can be found here.

With this, we have reached the end of this section where we discussed Linear Regression with 1 variable (x). In case of more than one variable, our equation of the line is slightly modified as follows;

Y = W1x1 + W2x2 + W3x3 + ……… + Wnxn + b ……….(1)

where there are total n number of input features. Representing the predicted values by y_hat, our cost function for Linear Regression with multiple variables becomes:

J = sum_over_all_training_examples((y_hat_i — y_i)²)

Here, we’ll put y_hat_i as equation (1) and y_i as the original value for the training example i.

In the gradient descent algorithm, we’ll compute partial derivatives of the cost function for all the parameters (W1, W2, …. Wn, b) and then update these parameters until convergence.

Since we’ve developed an understanding of how Linear Regression works, let’s dive into a problem from Kaggle in order to cherish this accomplishment.

Problem Statement

Click here for the dataset. This dataset has 8 features and one output variable. Since the output values are real number values, this is a regression problem. This problem can be stated as multivariate linear regression problem as:

Y = W1x1 + W2x2 + W3x3 + W4x4 + W5x5 + W6x6 + W7x7 + W8x8 + b

Let’s see how python packages help us in solving this linear regression problem.

import numpy as np
import pandas as pd
from sklearn import datasets, linear_model
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

We have imported numpy for linear algebra and other mathematical operations, pandas for data pre-processing and sklearn for applying linear regression and evaluating the model’s performance.

df = pd.read_csv(‘/kaggle/input/graduate-admissions/Admission_Predict_Ver1.1.csv’)
df.head()

We read the input csv file through pandas and display the first 5 rows to study the properties of the data. Here, all the values are either integers or floating point numbers. If we had strings or other data types, we would have to parse them and convert them to numerical form to feed it to our linear regression model. The given data has 500 records, which is pretty small.

X = df.iloc[:, 1:-1].values
y = df.iloc[:, -1].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.15)

First of all, we separate our input and output values. Observe that we have ignored the serial no. column as serial no. has no effect on admission chances. Further, we split our data into corresponding training and test sets. We use 85% of the data for training and 15% of data for testing.

regressor = linear_model.LinearRegression()
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)

Now, we initialize the LinearRegression object from the scikit-learn library and feed it our training data for the linear regression algorithm to fit the best possible line for it. By default, the mean_squared_error cost function is used for training the model, so we don’t need to explicitly mention it here.

Let’s have a loot at first 5 predicted output values.

print(y_pred[:5])

To conclude, let’s have a look at the weight parameters and the r² score

print(‘Coefficients: \n’, regressor.coef_)
print(‘R2 Score: %.2f’ % r2_score(y_test, y_pred))

r² score of 0.83 tells us that we are making predictions with 83% confidence and is a metric used to evaluate regression problems. More on r² score can be found here. These are good results for a small dataset like this one. We can apply feature selection techniques which can lead our model to be more accurate. But since this is a very small dataset, this can lead to over-fitting. We’ll discuss about these problems and techniques in later posts.

The code for this post can be found here.

That’s it for this post. In the next post, we’ll have a detailed look at logistic regression. I’d love to hear your feedback for this post so please leave your invaluable suggestions down below.