Learning Python Regression Analysis — part 1 : Setups
Have you ever thought about the relationship between rainfall and air pressure, cold-drink sales and temperature or patient health with physical activity? To provide evidence to any such theoretical relationships in our mind, we need to estimate the relationship parameters or coefficients using regression analysis. Regression is a supervised learning problem and using regression analysis we follow the set of steps to learn the mapping from input to the output.
We can use regression analysis to predict sales projections using historical data, expected clicks on search results using web logs or the real estate prices using several factors. That’s not all; regression analysis can also help in various other tasks like system monitoring and variable selection for data pruning.
One of the major goals of regression analysis is to investigate and establish the relationships between two or more variables. This helps us to achieve two typical purposes, which are either predicting the value of the dependent variable or to estimate the effect of other variables on the dependent variable. For any real world problem, different factors impact the outcome of an event. Measurements of such different factors are called independent variables or predictors and the measurements of outcomes are referred to as dependent variable or response variable.
Most of the modeling techniques make some assumptions about the data. Based on these assumptions we try to fit a model to our dataset. To understand the regression better, we can visualize this problem in 2, 3 or more dimensions where our model is a geometric shape like line, plane or some random shape and we try to accommodate maximum data points using that shape with best fit.
Let us now discuss some important instruments in brief to get started with regression analysis procedure in python.
Introducing tools for Regression Analysis in Python
Regression analysis may involve iterative analysis by getting insights and tweaking the parameters. Python provides generators, which allow for easy step-wise processing, so Python is naturally very much suitable for our task. There are many more benefits of using python for e.g. easy coding syntax, quick prototyping, performance etc. but the most significant of them is the vast and continuously growing support for statistics and machine learning libraries in Python.
There are numerous data analysis, statistics and machine learning libraries in python with each one having its unique advantage over the others. We can use many alternative libraries but in this book we will focus on the usage of following libraries for the sake of consistency. In the following sections we will quickly walk through the tools of our interest without going into the details of their installations and setup.
1. NumPy & SciPy
These are fundamental packages for scientific computing. These libraries provide convenient and fast arrays data structure along with fast operations for basic statistics, linear algebra, random simulation and many other mathematical and logical operations etc. SciPy contains more featured and detailed versions of the linear algebra and numerical operations modules. For doing scientific computations, we should both NumPy and SciPy. SciPy provides most of the NumPy’s functions in its own namespace but some of the method implementations are different in these two tools.
Before starting with the regression codes, we should verify if NumPy and SciPy are installed properly in the python environment by running the following code snippet in Python console:
>>> import numpy as np
>>> import scipy as sp
Pandas is a data manipulation and analysis library. It introduces R like data structures for Series and DataFrames, both of which are built on top of NumPy. We will use Pandas majorly to generate descriptive statistics from the data and to inform some of the design decisions of our model. Let us play around with Pandas using a short and simple example.
>>> import pandas as pd
>>> import numpy as np
>>> a=pd.Series([1.1, 1.2, 1.3, 1.4, np.nan, 1.2, 1.3])
There may be empty or corrupt values while reading data from different sources due to various reasons. We may wish to fill these NAN values with the median of other data.
This is a plotting library in python which is used to create plots, histograms, and other 2D charts with Python. We will also use it to visualize our datasets and learned models. Let’s consider this example of drawing a scatter plot of random 50 points.
>>> import matplotlib.pyplot as plt
>>> import numpy as np
4. StatsModels and Scikit-Learn
StatsModels is statistical computing and data analysis library, which provides methods for descriptive statistics, statistical tests and plotting functions etc. for different types of data. Similarly Scikit-Learn library provides a collection of different machine learning algorithms. We should check if these libraries are properly installed in our environment.
>>> import sklearn
>>> import statsmodels as sm
In upcoming blogs:
In the next parts we will describe the task of regression analysis and the steps involved. In part 2 we will further discuss the use cases of linear regression. There are many tests for measuring the fit and adequacy of a model, which will be described in chapter 3. As we shall see in chapter 3, we will outline the general procedure for testing of various hypotheses.
We will deal with an example of simple linear regression, which establishes the relationship between a single predictor variable and a response variable. In part 3 we will then describe the method of Ordinary Least Squares (OLS) in linear regression models. We will point out the details and implementation of OLS by computing variance and covariance terms. Then in part 4, we will cover the multiple linear regression, which can be seen as an extension to the simple linear regression. The used examples will help the readers to extend the linear regression procedure to other real world use cases as well.