Exploring Linear Regression

Single and Multivariate

Mulbah Kallen
Analytics Vidhya
Published in
5 min readSep 25, 2019

--

Linear regression is a statistical model that examines the relationship between two or more variables. The variables of interest are our dependent variable(y), our target, and our independent variable(s) (x), also known as our feature(s). It is important to note that linear regression is used for predicting numerical(continuous) data.

To start with we will look at a single linear regression model.

A linear relationship can be one of two things:

positive — spending more on advertising to increase sales
negative — increased surveillance decreases instances of theft

The relationship between y and x is represented by this equation

y = mx + bb (alpha)
m (beta)

Where y, our dependent variable, is the variable we are trying to predict and x, our independent variable, is the variable we are using to make our predictions. m is the slope of our regression line (positive or negative) and represents the effect that x has on y. b is a constant that is created from our data.

To begin with let’s import the necessary libraries and the advertising.csv dataset we will be using for our example as well as get a quick look at what our dataset looks like.

We have three independent variables that we can use to predict sales but as previously stated we will look at one variable. TV.

Next, we are going to initialize and fit our linear regression model using the ‘statsmodel’ library we imported. Our first model will consist of predicting sales based on how much was spent on TV advertisements.

I will take a moment here to talk about the specific type of linear regression model we are running as you may have noticed smf.ols where smf is our stats library and ols stands for Ordinary Least Squares. This method “estimates the relationship by minimizing the sum of the squares in the difference between the observed and predicted values of the dependent variable configured as a straight line.”

Visually OLS looks like this:

Ordinary Least Squares Regression Model

Note that the distance between each colored dot(real data point) and the red line (regression line) is measured then squared. OLS seeks to find the line of best fit that minimizes the area of the squared distances created by the regression line. The greater your OLS results the more ‘Efficient’ your model is at predicting the target.

“There are many other prediction techniques much more complicated than OLS, like logistic regression, weighted least-squares regression, robust regression and the growing family of non-parametric methods.”
— — Victor Powell and Lewis Lehe

Immediately after we will run model.params in order to get our b and m values conveniently labeled.

Recalling y = mx + b we now have sales = 0.047*TV + 7.032.

To get a greater in-depth summary of our model we will run print model.summary()

There is a lot of information in the results but the key takeaways, for now, are the R-squared (top right) and the confidence interval located at the end of the Intercept and TV row. R-squared is a .612 which is not horrible but can be improved. We still have two more variables we can add (radio, newspaper). Our confidence interval is narrow but can also use an improvement.

R-squared is a statistical measure of how close the data are to the fitted regression line

At this point our regression model has been fit (trained) so we can go ahead and predict the value of sales using the previously stated equation or we can set a new variable equal to model.predict() and graph our results.

With this model, we can now predict what sales will be from any amount spent on TV.

Based on our model if 400 is spent on TV advertisements sales will increase by 26 units.

Continuing with multivariate linear regression is as simple as adding another variable to our model. We will use the same data but this time we will use a different library.

Next, we will build our linear regression model using TV, radio, and newspaper as predictors. We will also split our data into predictors X and output y followed by initializing and fitting(training) our model.

Once we’ve trained our data we can go ahead and view our alpha and beta values.

Finally, we can go ahead and create our predictions and see if the additional variables have strengthened our model.

Because we used a different initial library we will import r2_score from sklearn.metrics in order to see how well we trained our model.

Our final score was .897 a significant increase from .612. Adding more variables(features) to train our model improved our models' efficiency.

In our next section, we will discuss how to deal with predictors that are categorical variables.

--

--