First steps into AI and Linear Regression

Aiswarya M
6 min readAug 28, 2020

--

AI Workshop — Part I

Doesn’t the thought of “Artificial Intelligence” (AI) sound exciting??? It’s all the rage now. As technology and human skills are progressing, people from different sectors are coming together to inculcate the decision-making thought onto machines. So, what is Artificial Intelligence? It is the development of machines to think, perceive the world and act like a human being. It involves translating certain human traits into machines such as visual and audio perception, decision-making, thought process, cognition.

I am Aiswarya. I have completed my B.Tech in Mechatronics this year and started working as a Design Engineer. A month back, I had registered for a workshop on AI to be conducted by Lema Labs, IITM. We began the basics of AI, which involved the definition, classification and applications. Firstly, we familiarised ourselves with Python basics, Numpy and Pandas packages. The coding was done in Python using Google Colab Notebook, which is a user-friendly, cloud-based IDE.

Linear Regression

Linear Regression comes under Supervised Machine Learning that predicts a dependent variable (Output — y) based on independent variable(s) (also called ‘features’) (Input — x). If there is one input feature (n=1), it is called univariate linear regression. Otherwise (n>1), it is a multivariate regression

The task was to predict the price (Output — y) based on the square feet size of the house (Input — x).

1. Reading the data

A CSV file comprised of x and y along with the first two columns. The file was read. Let’s say there was m number of data samples.

2. Linearity check

The house size (x) was plotted against the house price (y) to verify if the relation between x and y was linear.

Square feet size of the house (x) V/S the house price (y) read from the CSV file (raw data)

3. Feature Scaling

Data pre-processing is an important step to make the range of x and y same, generally between -1 to 1. This makes it easier to do mathematical computations within the same range. The scaling method used here is called “Mean Normalisation”, in which the ‘normalised’ x is computed from the raw values of x:

Eq(1): x = (raw_x — Mean(raw_x))/(Max(raw_x) — Min(raw_x))

y was also normalised. Hereafter all the computations were carried out on the normalised values.

4. Linear Form

Since the output (price) varies linearly along with input, house size(x), the mathematical equation will be

Eq(2): f(x) = θx +k

If θ and k are replaced with θ1 and θ0 respectively, the equation becomes

Eq(3): f(x) = θ0 + xθ1

The coefficient of θ0 is called the Bias term. x and y are column arrays of size (m,1) representing house size and price respectively. They were converted to matrix forms. A column of 1s was inserted into the first column of x to insert the Bias. Now the size of x is (m,2).

The θ matrix (of size (1,2)) was formulated as

Eq(4): θ = [θ0, θ1]

θ0 and θ1 are the “parameters”. They are initially assigned 1.

In the case of multivariate regression, where there are n number of input features, Eq(3) can be generalised as

Eq(5): f(X) =X*Transpose(θ)

Eq(6): f(X) = θ0 + x1*θ1 + x2*θ2 + … + xn*θn

where

Eq(7): X = [1, x1, x2, … xn], Size(x) = 1 row, n+1 columns

Eq(8): θ = [θ0, θ1, θ2, … θn], Size(θ) = 1 row, n+1 columns

5. Error computation

The parameters should be found such that the error difference between the f(x) and y be minimum. To avoid the discrepancy of positive and negative errors, the squared error was considered. The error was computed for all the m data samples. The overall error cost function, J, between f(x) and y is:

Eq(9): J = (1/2m)*(f(x)-y)²

This is the initial error. J is a function of θ.

6. Training using Gradient Descent Algorithm

The y dataset can be split into datasets for training and testing, and the henceforth steps can be carried out for these datasets individually.

Gradient Descent is an algorithm used to find the parameters that account for minimum error cost function (J), which ranges from 0 to +∞.

The following steps of the algorithm take place for a large number of iterations for the cost function, J, reaches a minimum.

  • A ‘gradient’ is added to the initial/old parameters to get the updated/new parameters.

Eq(10): Gradient = -c*∂J(θ)/∂θ

The gradient is negative because the cost function, J, tends to reach a minimum value. In Eq(10), ‘c’ is a hyperparameter called the ‘learning rate’. If the learning rate is too low, then more number of iterations might be required to reach the minimum error cost and it also increases the computation time. If the learning rate is too high, then the error cost, J, might skip a few values. The learning rate is found by trying different values arbitrarily and adopting the value that yields the best result.

Eq(11): Δθ(old) = Gradient

Eq(12): θ(updated) = θ(old) + Δθ(old)

  • The updated parameters are substituted in Eq(3) to compute the new error.

The steps are carried out for all the parameters. For univariate linear regression, there is only one parameter (n=1), so there are 2 parameters (θ0, θ1) (one parameter for one feature and one parameter for bias term). In this case, there is only one feature, which is house size. In the case of multivariate regression (n>1), there will be (n+1) parameters.

After finding the θ(updated) for all the m+1 parameters, the θ for the next iteration are simultaneously updated. The error cost function, J, is recorded at every iteration.

Error cost function (J) V/S the number of iterations

It is the nature of the algorithm to yield a low error cost as the number of iterations increases, i.e, increasing the number of iterations minimises the error function at the expense of higher computation and higher run-time.

This is how the model is trained with a dataset of predetermined y values.

7. Accuracy of the model

Now that the parameters are found, the prediction model is ready.

Eq(13): y_predicted = f(x) = θ0 + xθ1

where θ0 and θ1 are the latest values found using the Gradient Descent Algorithm.

The accuracy of this model is found from the mean absolute error between the y_predicted (predicted output) and y (original output)as:

Eq(14): Accuracy = 1 — Mean_Absolte_Error(y_predicted, y)

Eq(15): Accuracy_in_percentage = Accuracy*100 %

8. Predicting the output in real-time

For predicting an output using this model in real-time, the input features are taken from the user, normalised, subjected to prediction model Eq(13) and denormalised. The range of the output lies +/- offset_output to the y_predicted, where

Eq(17): offset_output = y_predicted*((1/Accuracy) — 1)

Thus the output of the model is y_predicted +/- offset_output.

Challenges faced

  • It was difficult to determine an optimum learning rate. It was varied arbitrarily and the one that gave maximum accuracy was chosen.
  • If the input features involve strings, then label encoders should be used to find their equivalent float value.
  • In the case of multivariate linear regression, the n input features were framed into a matrix and then normalised. This didn’t work out well as the product of the matrix and the transpose of new parameters resulted in a Null matrix. This might be because the normalised input matrix (row vector) comprised of null values. This might have happened because of mismatch in the matrix before adding the bias term. This problem was solved by individually normalising all the input features and appending to the input matrix x after 1 (Bias term).

--

--