Jupyter: Create Auto Interpretation for Multiple Regression Analysis

Adrian Hartanto
Bina Nusantara IT Division
4 min readDec 29, 2021
Photo by Carlos Muza from Unsplash

I use RStudio a lot for analysis in statistics. Things like finding, processing, and presenting the data. In this article, I will explain how to present the data in Jupyter python so it can ease your work at presenting the data.

What is Jupyter

Jupyter Logo from WIKIMEDIA COMMONS

Jupyter is open-source software, open standards, and services for interactive computing across dozens of programming languages. In Jupyter there exist JupyterLab for you to choose your preferred programming language. Then there also exists Jupyter Notebook for you to share your code easily and interactively. For installation and documentation of Jupyter can be learned further via this link https://jupyter.org.

So there are a lot of advantages to using Jupyter. But in this article, I will only show you how to use Jupyter in Python programming language and create interpreted analysis about multiple regression into Jupyter notebooks.

Definition of Multiple Regression

Multiple regression is a statistical technique to analyze several variables. Basically, this technique creates a model from dependent and independent variables. The purpose of multiple regression is to know the significance correlation from each variable and create a model from the data.

From the model created, you can know how much you can trust your model via a certain metric. You can also predict the dependent value from the model that has been created(if the model represents a good metric value of being a good predictor).

Analysis in Jupyter Python

For analysis, I’m using dummy data about a group of students. The variables are age, tall, and shoe size of the student.

To create the data frame you need to import pandas and here is the example:

import pandas as pddata = {
"age": [ 21,18,20,19,19,20,21,22,18,22 ],
"tall": [ 174,168,170,171,175,180,173,170,165,184 ],
"shoeSize": [ 44,41,42,42,43.5,45,43,44,41,46 ],
}
df = pd.DataFrame(data)
print(df)
result in Jupyter.

I choose the tall variable for my dependent variable and for independent variables are age and shoe size of the students.

For defining dependent and independent variables in Jupyter Python:

y = df['tall'].values
print(y)
x = df.drop(['tall'],axis = 1).values
print(x)

Using Pearson method of correlation in Jupyter Python:

You need scipy library to use Pearson method in Python

from scipy import statscorr = stats.pearsonr(df['tall'], df['age'])corr2 = stats.pearsonr(df['tall'], df['shoeSize'])

Multiple regression modeling in Python:

You need numpy and sklearn.linear_model library to do modeling in python.

import numpy as np
from sklearn.linear_model import LinearRegression
model = LinearRegression().fit(x, y)
coefIndep = model.coef_
coefDep = model.intercept_

R-Squared in Python:

You need sklearn.metrics library.

from sklearn.metrics import r2_scorer2_score(y,model.predict(x))

Interpret The Analysis

The plot from each independent variable to dependent variable:

Correlation Value Between Tall and Age:

The Correlation Coefficient is 0.5685523267490409

There is no significant correlation between the two variables because p-value = 0.08634960320803789

Correlation Value Between Tall and Shoe Size:

The Correlation Coefficient is 0.9101661204768641

There is a significant correlation between the two variables because p-value = 0.0002553496093370704

The model is:

Y =  26.013513513513487 + ( -1.3972972972972975 ) X1 + ( 4.054054054054054 ) X2

R-Squared:

Model accuracy to describe the data is 88.13929313929319 %

Table Actual Value & Predicted Value:

Summary

To recap in this article we have learned about:

  • What is Jupyter
  • Definition of Multiple Regression
  • Analysis in Jupyter Python
  • Interpret The Analysis

To download the Jupyter code and interpretation, you can click here.

--

--