Model Development for Data Analysis

Piyush Kumar

Published in

Analytics Vidhya

5 min readApr 19, 2020

“Let’s train, build and visualize our data”

What is Model Development
Simple Linear Regression
Model Evaluation using Visualization
Pipeline
Measures for In-Sample Evaluation
Prediction and Decision Making

1. What is Model Development?

A Model Development is considered as a mathematical equation that we use to predict the results. Basically, we relate one or more independent variables to other dependent variables.
In other word, we can say that we establish exact relationship between different variables and those variables are used to predict values.
If we consider more relevant data we can have a more accurate model.

2. Simple Linear Regression

A Simple Linear Regression is a relationship between two variables. In which one independent variable is used for prediction.

Independent (predictor) Variable → X
Dependent (target) Variable → Y
Linear RelationShip →Y=(b_0) +(B_1)X
where, (B_0) → Intercept and , (B_1) → Slope
If we assume that there is a linear relationship between these variables then we can build a model to ascertain the prediction.

#Import Linear_model from Scikit_learn
from sklearn.linear_model import LinearRegression#Linear Regression object using constructor
lm = LinearRegression() 
lm#Defining the Independent and Dependent Variable
X = df[["Active"]]     
Y = df[["Confirmed"]]#Fit the model i.e., find the parameters (B_0) and (B_1)
lm.fit(X,Y)#Obtain the Prediction
yhat = lm.predict(X)
yhat[0:10]           #the output is an array#Obtain Intercept
lm.intercept_#Obtain Slope
lm.coef_

3. Model Evaluation using Visualization

Regression plot: It shows the combination of Scatter Plot and Fitted Linear Regression. It gives a good estimate of :

Relationship between two Variables
Strength of Correlation
The direction of Relationship (either positive or negative)

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inlinewidth = 15
height = 10
plt.figure(figsize=(width, height))
sns.regplot(x="Active", y="Confirmed", data=df)

Residual plot: It gives(or represents) the error between the actual value and examine the predicted value and actual value. We obtain that value by subtarcting the predicted and actual target value.

width = 12
height = 10
plt.figure(figsize=(width, height))
sns.residplot(df['Active'], df['Confirmed'])
plt.show()

Distributed plot: Counts the predicted value v/s actual value. They are extremly useful with more than one independent variable. The Independent and Dependent values are continous.

ax1 = sns.distplot(df["Active"], hist=False, color="r", label = "Actual Value")
sns.distplot(Yhat, hist=False, color="b", label="Fitted Value", ax=ax1)

We can see that the fitted values are close to the actual values since the two distributions overlap.

4. Pipeline

Data Pipelines simplify the steps of processing the data.

from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScalerpr=PolynomialFeatures(degree=2)
prZ = df[["Active", "Confirmed"]]
Z_pr=pr.fit_transform(Z)Z.shape

Z_pr.shape

Input=[('scale',StandardScaler()), ('polynomial', PolynomialFeatures(include_bias=False)), ('model',LinearRegression())]pipe=Pipeline(Input)
pipe

pipe.fit(Z, Y)

ypipe=pipe.predict(Z)
ypipe[0:4]

5. Measures for In-Sample Evaluation

It is the way by which we can numerically define how good the model fits on our dataset.

lm.fit(X, Y)
print('The R-square is: ', lm.score(X, Y))

We can say that ~ 97.09% of the variation of the Confirmed cases is explained by this simple linear model.

Yhat=lm.predict(X)
print('The output of the first four predicted value is: ', Yhat[0:4])

from sklearn.metrics import mean_squared_errormse = mean_squared_error(df['Confirmed'], Yhat)
print('The mean square error of Confirmed cases and predicted value is: ', mse)

6. Prediction and Decision Making

Prediction

To determine the final best fit.
To genrate a sequence of values in a specified range.

import matplotlib.pyplot as plt
import numpy as np
%matplotlib inlinenew_input=np.arange(1, 100, 1).reshape(-1, 1)lm.fit(X, Y)
lm

yhat=lm.predict(new_input)
yhat[0:5]

plt.plot(new_input, yhat)
plt.show()

Decision Making: Determining a Good Model Fit

Now that we have visualized the models, and generated the R-squared and MSE values for the fit, how do we determine a good model fit?

What is a good R-squared value?

The model with the higher R-squared value is a better fit for the data.

What is a good MSE?

The model with the smallest MSE value is a better fit for the data.

Let’s take a look at the values.

Simple Linear Regression: Using Acitve as a Predictor Variable of Confirmed.
R-squared: 0.9709046123701057
MSE: 19184480.15637814

Conclusion

In this article, we learned how to train, visualize our data. Learned how to build simple pipelines for processing data. Calculated values that show how our model fits on the dataset. And at last performed Prediction and Decision Making to find how to determine a good model fit.

Reference

You can find above code on my Github:

kpiyush04/covid-19-Data-Analysis-beginner-level

Data Analysis beginner level. Contribute to kpiyush04/covid-19-Data-Analysis-beginner-level development by creating an…

github.com

All the best for your Journey.
Do leave a clap if this article was helpful.
If you have some doubts then the comment section is all yours. I’ll try my level best to answer your questions.
Thank you❤