Simple Linear Regression Using Python

Vijay Gadre

Published in

Geek Culture

5 min readOct 23, 2021

Prerequisites

Familiarity with the Anaconda Environment
Familiarity with Python

To get a hands-on experience I’d suggest following along with the article:

First of all, open Anaconda Navigator, and click on Jupyter Notebook’s “Launch” button. It would create a Jupyter Notebook on your localhost web browser.

After creating the notebook, navigate to the directory where you have the dataset loaded (Link to the Dataset).

And finally, click on the New button on the top right corner and select Python, this will create a Python — Jupyter Notebook.

1. Import the Relevant Libraries

We can import a library using the Python function import().

To code a simple linear regression model using StatsModels we will require NumPy, pandas, matplotlib, and statsmodels.

Here is a quick overview of the following libraries:

NumPy — used to perform mathematical operations mainly using multi-dimensional arrays.
pandas — used for data manipulation and analysis.
matplotlib — it is a plotting library as a component of NumPy
statsmodels — it is used to explore data, estimate statistical models and perform statistical tests.

2. Import the Dataset

After importing the libraries, you can import/load the data into the notebook using the pandas method read_csv() (for CSV files) or read_excel() (for excel files).

It will import the data and you can verify it using the variable where you stored the imported data, in our case, it’s called data.

3. Descriptive Statistics

It is a good practice beforehand to get apprised with the descriptive statistics as it helps us to understand the dataset (eg. — are there any outliers present, etc.)

We can perform descriptive statistics using the following command:

dataframe.describe()

If you want to see the descriptive statistics of all column variables containing different data types, simply include dataframe.describe(include=‘all’)

Luckily, in our case, we don’t have any outliers in the data. We can determine outliers using the following steps:

Look for the mean of the column variable
Look for the min, 25%, 50%, 75% and max row variables.

The 25%, 55%, and 75% signifies that 25% of the data falls below 1772.00, 55% of the data falls below 1864, and 75% of the data falls below 1934 SAT scores respectively.

If the max value of the variable is far away from the mean then, we can say that outlier(s) is present in that particular column.

The SAT and GPA variables don’t have any outliers as the max observation is situated around the mean. If the max observation of SAT is 40000.00, then we can say that outlier is present and we have to remove that outlier.

4. Create Your First Linear Regression

Declare the Dependent and Independent Variables

To create a linear regression, you’ll have to define the dependent (targets) and the independent variable(s) (inputs/features).

We have to predict GPA based on SAT scores, so our dependent variable would be GPA and the independent variable would be SAT.

To declare so, we can use the following command:

dataframe[‘column_name’]

Explore the Data

We can plot the data using:

matplotlib.pyplot(independent variable, dependent variable)

(pyplot arguments — The first argument would be the data to be plotted on the x-axis, and the second argument would be the data to be plotted on the y-axis).

We can see that, as the SAT scores increase, so do the GPA scores. Hence, we can say that there is a linear trend between the two variables.

Linear Regression

Now let’s play with our best friend, statsmodels.

To perform a linear regression we should always add the bias term or the intercept (b0). We can do this using the following method:

 statsmodels.add_constant(independent_variable)

It’d create a new bias column equal in length to the independent variable, which consists only of 1's.

Let’s fit the Linear Regression model using the Ordinary Least Squares (OLS) model with the dependent variable and an independent variable as the arguments.

And finally, let's print the summary table:

It’d create 3 tables and we are mostly interested in the 2nd table which contains the coefficients, p-values, etc.

From the summary table, we can say that the intercept (const) coefficient is 0.275 and the independent variable coefficient is 0.0017 (signifies if SAT increases by 1 units, GPA increases by 0.0017 units)

If the p-value of the independent variable/s is greater than 0.050, we say that the variable is not significant and we may drop the variable. In our case, the p-value of SAT is 0.000 and as it is lesser than 0.050, we can say that this variable is significant. [Also, if we think logically, SAT is a significant variable to predict GPA]

And finally, the R-squared, R² is nothing but the goodness of the fit of the model. In other words, how well our model fitted the observation. R² ranges between 0 and 1. There is no clear universal range for the best R² but, it depends on a case-to-case basis.

Adjusted R² on the other hand, is a modified version of R-squared that has been adjusted for the number of predictors (inputs) in the model. The adjusted R-squared increases when the new independent term improves the model more than would be expected by chance. It decreases when a predictor improves the model by less than expected.