Linear Regression in python from scratch with scipy, statsmodels, sklearn

Andrea Castiglioni
Analytics Vidhya
Published in
5 min readAug 3, 2020

In this we will implement the needed code with numpy for a linear regression. Results will be compared with those from scipy and statsmodels

Data points, linear best fit regression line, interval lines.

1. Import libraries

As always, we start by importing our libraries. We start with our bare minimum to plot and store data in a dataframe.

2. Generate Data

We can either generate data or import it, but generating data has the advantage that we know what the relationship is between our variables. In this example we will be generating an x array of 50 points, linearly spaced between 0 and 20.
Then we create a matrix, X with different functions. We define a vector, beta with weights. Finally we calculate y_true by taking the dot product of X with beta. Note that since we have a random.normal() function call, the results may change from run to run.

nsample = 50
sig = 0.5
x = np.linspace(0, 20, nsample)
X = np.column_stack((x, np.sin(x), (x-5)**2, np.ones(nsample)))
beta = [0.5, 0.5, -0.02, 5.]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Here is X:

3. Plot the data

We can plot with seaborn the regression and the residuals plot. Seaborn is a Python data visualization library based on matplotlib.

Code and graphs of regplot and residplot from the library seaborn.

However, we can’t ask seaborn to do the dirty job for us: we would like to get the slope, intercept and other statistical measurements that the library did under the hood to plot those nice looking graphs.
We can however implement our own function and then… look in the world if something exists already!

4. Python/Pandas/Numpy

Following the theory and the simple theory we can implement our linear regression function. We explicitly calculate all the parameters needed in a pandas dataframe.

df = pd.DataFrame()
df['x']=x
df['y']=y
df['xy'] = df.x * df.y
df['x2'] = df.x**2
df['y2'] = df.y**2
df['x-xavg**2'] = (df.x - df.x.mean())**2
df['y-yavg**2'] = (df.y - df.y.mean())**2

We then define a function that does the job: it must accepts the two vector as inputs and return some interesting parameters. The parameters are:

a: intercept
b: slope
corr_coeff: correlation coefficient (Pearson)
sb: standard error of the slope

def lin_regr_func(x,y):
n = len(x)
# denominator
d = (n*(x**2).sum()-x.sum()**2)
# intercept
a = (y.sum()*(x**2).sum()-x.sum()*(x*y).sum())/d
# slope
b = (n*(x*y).sum()-x.sum()*y.sum())/d

y_pred = x*b+a
yyi = (y-y_pred)**2
xxi = (x.mean()-x)**2
#standard error
sb = np.sqrt(yyi.sum()/(n-2))/np.sqrt(xxi.sum())

# correlation coefficient
corr_coeff = (n*(x*y).sum() - x.sum()*y.sum())/(np.sqrt(n*(x**2).sum()-x.sum()**2)*np.sqrt((n*(y**2).sum()-y.sum()**2)))
return a,b,corr_coeff,sb

We immediatly put the function to good use and calculate the predicted y and also the errors.

Prediction of y from our function and head of the resulting dataframe.

Now using scatterplot and lineplot from seaborn, we can plot our data and our linear regression. It looks like we had the same results!

The resulting coefficients are

5. Scipy

We can also do the same job with a handy library for mathematical and scientific calculus, scipy! What we are looking for is a function for linear regression. We can find it under stats.

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.linregress.html
Linear regression results using scipy.stats.linregress function

The results are the same as those calculated on our own and we also have the pvalue which we will not discuss here, but the lower, the better if our hypothesis is that the slope is different than zero.

6. Statsmodels

Finally we reached the Statsmodels section!

importing statsmodels library

To use this library we basically need to just add a constant to our x in order to get also the intercept. Call the OLS (Ordinary least square) function and pass in our variables: note that differently from scipy here we need to input first the y and then the x.

X = sm.add_constant(x)
mod = sm.OLS(y, X)
res = mod.fit()
print(res.summary())

The summary of statsmodels is very comprehensive.

Here the eye falls immediatly on R-squared to check if we had a good or bad correlation. Then we see the coefficient for the intercept (const) and the x1 (slope) which are basically the same as those calculated previously. We can see that we also have the standard error for the coefficients which we calculated only for the slope.

If we want to plot a really cool plot we can import this library.

library for interval value lines

Thanks to the interval values we can now plot our plot with two nice looking bands, the fitted values and the real (blue line) formula.

We can import from statsmodels also two functions to calculate the mean squared error and the root mean squared error:

We can also calculate the MSE on our own, since we know its definition.

7. Sklearn

with its handy fit, predict methods, sklearn is the perfect tool to simplify everything.

with a simple function call we can fit our data and immediately have our r2 score.

As for the coefficients: we just need to know how to call the coef_ from the object. This is standard syntax in sklearn.

--

--