[DS0001] — Linear Regression and Confidence Interval a Hands-On Tutorial

Published in

The Startup

4 min readNov 25, 2020

Motivation

This tutorial will guide you through the creation of a linear regression model and a confidence interval from your predictor using some data science commonly used libraries such as Sklearn and Pandas. In our example case, the linear regression was used to determine how many charging cycles a battery can hold after die. Don’t worry if you do not understand anything about batteries, all the data will be available for download, and the only knowledge required here is about python language.

Import what we need

In order to use some already implemented tools, we need to import all the libraries and components. The next block of code import pandas, NumPy, and some scikit learn components, that will allow us to read our data, create the linear regression model and our confidence interval.

#!/usr/bin/env python3from sklearn.linear_model import LinearRegression
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats.stats import pearsonr
from scipy import stats

Loading the data

In this tutorial, I will be using some data from my research about battery state of life estimation. Don’t worry about the meaning of the data right now, It will not affect our results. Download the .csv file from here and paste it into the same folder as your main python file.

After it, you can just load the file on pandas and read the column named “voltage_integral” from the file, as I do on the code above:

my_pandas_file = pd.read_csv('cs_24.csv')
y_data = my_pandas_file.get('voltage_integral')

To create a linear regression, we will need another axis, in this case, our x-axis will be the index of our y_data vector. In this case, It’s good to notice that our model will require a 2d array, so let’s arrange it on the desired form using the reshape method.

x_data = np.arange(0,len(y_data), 1)
x_data_composed = x_data.reshape(-1,1)

Creating our model

After work with linear regression, It’s usual to see if there is a strong correlation between the variables. To see it, you must calculate the Poison coefficient and check it. If the correlation is near 1, it means that the variables have a positive strong correlation. If it’s neat -1, It means that the variables have negative strong correlations and if it’s near 0, It means that the variables do not have a correlation and the linear regression will not help. Python does provide a tool to easily calculate the correlation:

correlation = pearsonr(x_data, y_data)
>> (-0.9057040954006549, 0.0)

The value of -0.91 tells us that our data has a strong negative correlation, and as you can see on the graphic below, it means that when our x value increases, our y value decreases.

To create our linear model, we just need to use our imported component, and fit the model, using the data imported from the file. After it, just to see how our model is when compared to the graphics, we will plot the predicted vector from our source data:

lin_regression = LinearRegression().fit(x_cs_24.reshape(-1, 1), cs_24_integral_data)
model_line = lin_regression.predict(x_data_composed)plt.plot(y_data)
plt.plot(model_line)
plt.xlabel('Cilos')
plt.ylabel('volts x seconds')
plt.title('Voltage integral CCCT charge during batery life')
plt.ylim(0,8000)

Before running the code, the graphic below will show up on your screen:

Our model is already done and we already have our graphics. Now it’s time to add more confidence to our prediction model putting a confidence interval on the graphic.

Calculate and plot our confidence interval

A confidence interval of 95%, is an interval between values that our prediction has 95% of chances to be there. This is calculated based on the standard deviation and a gaussian curve. We will create a function to calculate our confidence interval for a single sample and then run it for all predictions.

def get_prediction_interval(prediction, y_test, test_predictions, pi=.95):'''Get a prediction interval for a linear regression.INPUTS:- Single prediction,- y_test- All test set predictions,- Prediction interval threshold (default = .95)OUTPUT:- Prediction interval for single prediction'''#get standard deviation of y_testsum_errs = np.sum((y_test - test_predictions)**2)stdev = np.sqrt(1 / (len(y_test) - 2) * sum_errs)#get interval from standard deviationone_minus_pi = 1 - pippf_lookup = 1 - (one_minus_pi / 2)z_score = stats.norm.ppf(ppf_lookup)interval = z_score * stdev#generate prediction interval lower and upper bound cs_24lower, upper = prediction - interval, prediction + intervalreturn lower, prediction, upper## Plot and save confidence interval of linear regression  - 95% cs_24lower_vet = []upper_vet = []for i in model_line:lower, prediction, upper =  get_prediction_interval(i, y_data, model_line)lower_vet.append(lower)upper_vet.append(upper)plt.fill_between(np.arange(0,len(y_data),1),upper_vet, lower_vet, color='b',label='Confidence Interval')plt.plot(np.arange(0,len(y_data),1),y_data,color='orange',label='Real data')plt.plot(model_line,'k',label='Linear regression')plt.xlabel('Ciclos')plt.ylabel('Volts x seconds')plt.title('95% confidence interval')plt.legend()plt.ylim(-1000,8000)plt.show()

After running the code, the result will show up like this:

So, this is how to create a linear regression and calculate the confidence interval from it. The data .csv file and the full code can be found here.

If you like this story and would like to see more content like this in the future, please follow me!

Thanks for your time, folks!