Linear Regression

Ayana K Sunil
4 min readAug 17, 2021

--

Things to be taken care of

Regression is a method to predict the target variable y which possesses the best linear relationship between the given independent and dependent values. The major goal of regression is to inspect the relationship between the input feature x with that of the target value y and then output a continuous-valued output for the unknown value given as the input.

Photo by Green Chameleon on Unsplash

Simple Linear regression uses a single independent variable to predict a dependent variable by fitting the best linear relationship.

Equation for linear regression is : y = bo + bi*x

Here ‘y’ is the dependent variable. The term ‘bo’ is the constant. The term ‘bi’ is the coefficient of x and finally, x is the independent value.

Before building the model we have to analyze the data. For that, we have to check whether there are any outliers present, and also we have to see the correlation between the dependent and independent data.

The process of examining the data, checking for the presence of outliers, and also looking for the correlation between the input value and the target values. All these processes come under EDA which is exploratory data analysis. In order to identify the presence of outliers in our data, we have to identify them using the help of a boxplot or a histogram, or both. It shows how the data is being distributed. It helps us to identify if the distribution of the data has any kind of outliers present in it or not. If any outliers are present then it has to be handled separately. Some techniques to resolve outliers are by deleting that particular observation or by trying to transform the values. We can perform imputations, or it has to be treated separately in a statistical model to get rid of the outliers.

If there are no outliers then, we have to see the correlation between the data. In order to see the correlation we mainly seek the help of a scatterplot. It helps to identify mainly three things they are linearity, direction, and strength of the dependent and independent data values. If the variables in a scatter plot have higher scattering(less linearity), there is no specific direction for the data distribution(direction), and if the data is distributed in a way such that it’s loosely packed then it is said to have a weak correlation. If the variables are distributed in a linearly positive direction and are tightly packed then it will have a strong correlation so that we could project a straight line that could define the data points accurately. Correlation is said to be strong if the value of the correlation coefficient(r) is greater than 0.85.

After this, it is time to split the data into a test set and a training set. Once this is done we could build the model using the training set which could explain the data well. Then we could input the test data to the very same model that we built to get the predictions. There are a few things to be taken care of while building the model they are the R² value(coefficient of determination), p-value(probability), and RMSE value(root mean squared error). R² represents the percentage of the variation in the output which can be explained by the input variables. This value should always be high. The higher the R² value better the model fits the data. The p-value or the probability value should always be less. It should be less than or equal to 0.5. The lower the p-value, the better the result. The RMSE value should always be lower. It gives the sum of the variation of the actual value with that of the predicted value which means the error. So it is better to have a model with the least number of errors.

Prediction and confidence interval is the next important term to be discussed. These are two types of intervals used for prediction in regression and other linear models. The prediction interval represents a range that a single observation is likely to fall based on the specified settings of the predictors. In this case, we cannot strongly say that the prediction will be in a specified range. A confidence interval of the predictors represents a range that the mean responses are likely to fall in the given specified settings of the predictors. Here we could strongly specify the limits of the interval that is, the upper limit and the lower limit. The predicted value definitely belongs to the specified range it won’t get misplaced. Hence the prediction intervals will always be wider than the confidence interval. This is because there is an added uncertainty involved in predicting a single response versus the mean response.

By this, we had covered everything that is involved in a simple linear regression. The code for the implementation of a simple linear regression along with the explanation and dataset is given here. Just have a look at it and understand how it works. Happy Learning!!!!!

--

--

Ayana K Sunil

Machine Learning Enthusiast. Writer. Dancer. Speaker. Currently pursuing my master’s in Computational Linguistics.