Learning Python Regression Analysis — part 2 : Gearing up with simple linear regression

In part 1, we started with basic setups. Regression analysis can be defined as the process to find these best-fit models, which also hold our assumptions, is exploratory in nature and we need to try out different parameter combinations and also we need to follow some analysis steps.

We generally employ following major steps in the tasks of Regression Analysis.

1. Start with a research hypothesis. Then we identify the response variables and the predictor variables. Then we collect and clean the data.

2. When we have more than one predictor variables we may start by checking the correlation among these variables. When some of these variable pairs are highly correlated then regression analysis may not be able to utilize the effect of either of these variables. In such cases we should consider removing one of these correlated variable pairs. This step is not useful in Simple linear regression because it has only one predictor variable.

3. We should check the trends between the response variable and predictor variables to check the need of data transformations and also to decide on which regression method to choose. We can also remove outlier points in the data.

4. Then we implement a regression model on this dataset. We can assess the adequacy of the fitted model.

5. We should handle any influential data points, which may be adversely affecting the fit of the model. We can also use model validation methods to assess the goodness of fit of our model. In case we choose to modify some parameters to remove some observations, we should repeat the analysis by rebuilding the model.

In the next sections we will see the usage and utility of each of these steps for different use cases.

Simple Linear Regression

We start by learning to use simple linear regression. We use simple linear regression to analyze how one variable depends on the level of other variable. The scenarios of one predictor variable and only one response variable with somewhat linear relationship are modeled with simple linear regression. We may predict the prices of pizza based on its size, or the rise in ground water level by amount of rainfall or prices of wine with time.

Let us consider a common problem from agriculture field. We are given the data about farm sizes and their respective wheat crop output for a specific season. In this case we may consider farm size as our predictor variable and yield as response variable and we will experiment with simple linear regression on this data, given in table 1 below.

`+--------------------------+----------------------+| Farm Sizes (in Hectares) | Crop Yield (in Tons) |+--------------------------+----------------------+| 1                        | 6.9                  || 1                        | 6.7                  || 2                        | 13.8                 || 2                        | 14.7                 || 2.3                      | 16.5                 || 3                        | 18.7                 || 3                        | 17.4                 || 3.5                      | 22                   || 4                        | 29.4                 || 4.5                      | 34.5                 |+--------------------------+----------------------+`

We can start by visualizing this data in python using matplotlib’s scatter plot.

`>>> import matplotlib.pyplot as plt>>> import numpy as np>>> X=[1,1,2,2,2.3,3,3,3.5,4,4.3]>>> Y=[6.9,6.7,13.8,14.7,16.5,18.7,17.4,22,29.4,34.5]>>> plt.scatter(X,Y)>>> plt.show()`

The above code snippet generates a scatter plot, as shown in figure2, of the sample data with crop yield on Y axis and farm sizes on X axis.

In this example we will try to establish a linear relationship between farm sizes and crop yield. Our relationship can be described in the form of an equation of a straight line.

y = a (x) + b

There could be multiple such lines with different parameter values of a and b, which can be described as fitting to the data. We need some statistics criteria to find the best fitted straight line.

To measure the best fit in simple linear regression, one of the major methods is Ordinary Least Squares (OLS). It minimizes the sum of squared vertical distances between the observed data points and the model line. This model line is used as a function to predict values for news observations. OLS method is used heavily in various industrial data analysis applications. The details of Ordinary Least Square and its implementation are provided in the next section.

We can use statsmodels package for our task of simple linear regression as it provides several different options for linear regression and performing OLS is pretty easy with it.

`>>> import numpy as np>>> import statsmodels.api as sm>>> X=[1,1,2,2,2.3,3,3,3.5,4,4.3]>>> Y=[6.9,6.7,13.8,14.7,16.5,18.7,17.4,22,29.4,34.5]`

In the code snippet above, we had loaded the farm areas as array X and yield as array Y. We can use other methods of data loading and different data structures, which we will describe in detail in next chapter. Also, in this chapter we are presenting examples from one single session of a python console so some imports or variable declarations used in a section may not be repeated in the further sections of this chapter.

By default, OLS implementation of statsmodels does not include an intercept in the model unless we are using formulas.

We need to explicitly specify the use of intercept in OLS method by adding a constant term.

`>>> import statsmodels.api as sm>>> X_1=sm.add_constant(x)>>> print X_1`
`[[ 1.   1. ][ 1.   1. ][ 1.   2. ][ 1.   2. ][ 1.   2.3][ 1.   3. ][ 1.   3. ][ 1.   3.5][ 1.   4. ][ 1.   4.3]]`

This way we have added one column of constant values 1 in the predictor array. Now our model will take a form of

y = b.1 + a.x

Where a and b are called parameter or coefficients of the model which will be estimated by OLS method below.

`>>> model = sm.OLS(Y,X_1)>>> results = model.fit()>>> print results.params`
`[-1.32137039  7.42581241]`

We have obtained values of regression coefficients as a=7.42581241 and b=-1.32137039 . We can also visualize the regression model by plotting the regression line using the predicted Y values for our in-sample data.

`>>> Y_predicted=results.predict(X_1)>>> plt.scatter(X,Y)>>> plt.xlabel("Farm size in hectares")>>> plt.ylabel("Crop yield in tons")>>> plt.plot(X,Y_predicted, "r")>>> plt.show()`

In the previous example we had used a constant term to denote an intercept. We will now explore the usage of statsmodels formula api to use formula instead of adding constant term to define intercept. In formula api, class name is ols instead of OLS and the input parameters are a dataframe and a formula parameter in the form of:

Response variable ~ Predictor variables combination
`>>> import statsmodels.formula.api as sm1>>> import pandas as pd>>> df1=pd.DataFrame(X,columns=['X'])>>> df1['Y']=Y>>> results_formula = sm1.ols(formula='Y ~ X', data=df1).fit()>>> print results_formula.params`
`Intercept   -1.321370`
`X            7.425812`

Since formula api automatically adds an intercept, there may be cases where we do not wish to use the intercept in our model. In such cases, we can explicitly define the method to not to use the intercept using the minus operator, as demonstrated in code snippet below.

`>>> results_without_intercept = sm1.ols(formula='Y ~ X - 1', data=df1).fit()>>> results_without_intercept.params`
`X    6.994877`

Prediction

Tests of adequacy and fitness of models will be covered in next chapter. For the purpose of simplicity, let us consider that if we find the model adequate then we can use the learned model to predict Y variable for out of sample X variables. Now, using our learned model for farms, we can predict the values for expected yield for new farm sizes.

If we are not using formula then prediction requires creating an array of new X values and then repeating the same method of adding a constant for intercept.

`>>> new_X=[5,5.5,6,7]>>> new_X_1 = sm.add_constant(new_X)>>> results.predict(new_X_1)`
`array([ 35.80769166,  39.52059787,  43.23350407,  50.65931648])`

But, if we are using formula then prediction requires creating a DataFrame with exactly same column names for new X values, as in DataFrame used for learning the model.

`>>> df_new = pd.DataFrame([5,5.5,6,7],columns=['X'])>>> df_new>>> results_formula.predict(df_new)`
`array([ 35.80769166,  39.52059787,  43.23350407,  50.65931648])`

Table 2: predicted crop yields for new farm sizes.

`+--------------------------+-----------------------+| Out of sample farm sizes | Predicted crops yield |+--------------------------+-----------------------+| 5                        | 35.80769166           || 5.5                      | 39.52059787           || 6                        | 43.23350407           || 7                        | 50.65931648           |+--------------------------+-----------------------+`

In next part 3 we will look at the implementation of Ordinary Least Squares (OLS).

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.