Jupyter: Create Auto Interpretation for Multiple Regression Analysis
I use RStudio a lot for analysis in statistics. Things like finding, processing, and presenting the data. In this article, I will explain how to present the data in Jupyter python so it can ease your work at presenting the data.
What is Jupyter
Jupyter is open-source software, open standards, and services for interactive computing across dozens of programming languages. In Jupyter there exist JupyterLab for you to choose your preferred programming language. Then there also exists Jupyter Notebook for you to share your code easily and interactively. For installation and documentation of Jupyter can be learned further via this link https://jupyter.org.
So there are a lot of advantages to using Jupyter. But in this article, I will only show you how to use Jupyter in Python programming language and create interpreted analysis about multiple regression into Jupyter notebooks.
Definition of Multiple Regression
Multiple regression is a statistical technique to analyze several variables. Basically, this technique creates a model from dependent and independent variables. The purpose of multiple regression is to know the significance correlation from each variable and create a model from the data.
From the model created, you can know how much you can trust your model via a certain metric. You can also predict the dependent value from the model that has been created(if the model represents a good metric value of being a good predictor).
Analysis in Jupyter Python
For analysis, I’m using dummy data about a group of students. The variables are age, tall, and shoe size of the student.
To create the data frame you need to import pandas and here is the example:
import pandas as pddata = {
"age": [ 21,18,20,19,19,20,21,22,18,22 ],
"tall": [ 174,168,170,171,175,180,173,170,165,184 ],
"shoeSize": [ 44,41,42,42,43.5,45,43,44,41,46 ],
}
df = pd.DataFrame(data)print(df)
I choose the tall variable for my dependent variable and for independent variables are age and shoe size of the students.
For defining dependent and independent variables in Jupyter Python:
y = df['tall'].values
print(y)x = df.drop(['tall'],axis = 1).values
print(x)
Using Pearson method of correlation in Jupyter Python:
You need scipy library to use Pearson method in Python
from scipy import statscorr = stats.pearsonr(df['tall'], df['age'])corr2 = stats.pearsonr(df['tall'], df['shoeSize'])
Multiple regression modeling in Python:
You need numpy and sklearn.linear_model library to do modeling in python.
import numpy as np
from sklearn.linear_model import LinearRegressionmodel = LinearRegression().fit(x, y)
coefIndep = model.coef_
coefDep = model.intercept_
R-Squared in Python:
You need sklearn.metrics library.
from sklearn.metrics import r2_scorer2_score(y,model.predict(x))
Interpret The Analysis
The plot from each independent variable to dependent variable:
Correlation Value Between Tall and Age:
The Correlation Coefficient is 0.5685523267490409
There is no significant correlation between the two variables because p-value = 0.08634960320803789
Correlation Value Between Tall and Shoe Size:
The Correlation Coefficient is 0.9101661204768641
There is a significant correlation between the two variables because p-value = 0.0002553496093370704
The model is:
Y = 26.013513513513487 + ( -1.3972972972972975 ) X1 + ( 4.054054054054054 ) X2
R-Squared:
Model accuracy to describe the data is 88.13929313929319 %
Table Actual Value & Predicted Value:
Summary
To recap in this article we have learned about:
- What is Jupyter
- Definition of Multiple Regression
- Analysis in Jupyter Python
- Interpret The Analysis
To download the Jupyter code and interpretation, you can click here.