Application and Interpretation with OLS Statsmodels

Buse Güngör
Analytics Vidhya
Published in
8 min readApr 19, 2021
Photo by Mika Baumeister on Unsplash

In this article, it is told about first of all linear regression model in supervised learning and then application at the Python with OLS at Statsmodels library.

As you known machine learning is a form of AI where based on more data, and they can change actions and response, which will make more efficient, adaptable and scalable. e.g., navigation apps and recommendation engines.
It is the intersection of statistic and computer science. Building a model by learning the patterns of historical data with some relationship between data to make a data-driven prediction. ML is classified into:

1. Supervised

2. Unsupervised

3. Reinforcement learning

In this article we will be talking about linear regression in supervised learning. Supervised learning, the model is based on the data whose results are known. According to this model, the results of the data without label information in the data set are predicted. For example, spam classification is a supervised learning. It is known a large number of email are inputs that are spam. Given a new email, the algorithm will then produce a prediction as to whether the new email is spam. There are two major types of supervised machine learning problems, called classification and regression.

Regression: Forecast a given numerical quantity. Predicting a person’s weight or how much snow we will get this year is a regression problem, where we forecast the future value of a numerical function in terms of previous values and other relevant features.

Linear regression

Linear models are a class of models that are widely used in practice and have been studied extensively in the last few decades, with roots going back over a hundred years. Linear models make a prediction using a linear function of the input features.

In other words, its main purpose is to find the linear function expressing the relationship between dependent(y) and independent/s(x) variable. Xp represents the pth predictor, βp quantifies the association
between that variable and the response. We interpret βp as the average
effect on Y of a one unit increase in Xp , holding all other predictors fixed. Also, βp can be called the learned coefficients.

Understanding the regression with Math
x: independent variable(new data), y: dependent variable

There is no straight line that runs through all the data points. So, the objective here is to fit the best fit of a straight line that will try to minimize the error between the expected and actual value. Linear regression has two main purposes. First, estimating the values ​​of the dependent variable through the variables determined to affect the dependent variable. The second is to determine which of the independent variables that are thought to affect the dependent variable, or how and in what way the dependent variable is affected. Loss Function for Regression, is proportional to the square of the loss we experience as we move away from the true value. When training a model, we are not only concerned with minimizing the loss of a sample, we care about minimizing the loss of our entire data set.

Squared Loss : a popular loss function

= the square of the difference between the label and the prediction

= (observation — prediction(x))²

= (y — y’)²

So, what is the place of OLS Statsmodels in linear regression model?

OLS (Ordinary Least Squares) is a statsmodel, which will help us in identifying the more significant features that can has an influence on the output. OLS is an estimator in which the values of β0 and βp (from the above equation) are chosen in such a way as to minimize the sum of the squares of the differences between the observed dependent variable and predicted dependent variable. That’s why it’s named ordinary least squares.

How can OLS statsmodels is applied?

First we need to know the data set we will use. Dataset is been at the book which is named An Introduction to Statistical Learning with Applications in R. This data set reflects advertising expenditures. Advertising expenditures are provided through TV, Radio and Newspaper and as a result, sales are obtained. First we define the variables x and y. In the example below, the variables are read from a csv file using pandas. (The “usecols” path can be used to avoid taking the index as a variable).

Import the dataset with pandas

df.info is shown dataset’s structure which means for this dataset have 200 observations, all variables is continuously and there is no missing observations.

Dataset’s structure

Its descriptive statistics can be examined with df.describe().T

While the average of the independent variable of the TV variable is 147, its minimum value should be 0.7. The standard deviation value is considered to arise from the minimum value. It is seen that the median is not moving away from the average. To add, it is not a problem that the standard deviation is this large. It just shows that the distribution of the variable is more heterogeneous. We can say that there is a uniform distribution when we look at such cases as other independent radio and newspaper, and also the sales dependent variable, that the skewness and kurtosis are not seen. (Skewness and kurtosis are analyzed from differences between quartiles when looking at median, mean, standard deviation).

Descriptive statistics

df.corr() is shown correlation between the variables.

If we examine the correlation between the variables, the positive strong correlation between the TV variable and the Sales variable tells us that as the TV advertisement increases, the Sales variable will increase. There is a moderate positive correlation between the Radio variable and the Sales variable.

The correlation between the variables

We can also observe these correlation relationships with the pairplot in the “Seaborn” library.

Pairplot for variables and correlation between the variables

When we examine the distribution of variables, we notice the left skew in the distribution of the newspaper. If we examine the variable of TV and Sales, we observe a strong positive relationship in linear regression. Where the slope indicates the severity of the relationship.

The general structure was understood with exploratory data analysis. Now we will install our model with “Statsmodels” library. We import both “statsmodels” and “sklearn” libraries for OLS and split operations.

Import of required libraries

We separate the arguments in the data set as X that is independent variables, and the dependent variable y, namely “Sales”, with drop.na (). We divide the data set at a certain rate to train and test the model. the ratio here is such that it is 20% of test size’s entire data set.

Random_state, we must write “random_state” value so that it does not produce different values ​​for each model.

Model building

The model is established with the dependent variable y_train and the X_train argument. After the model is fit, we can observe the outputs of the model with the summary function. We cannot obtain statistical detailed information on the model that we set up with “sklearn”. After setting up the model with the OLS function, there is the ability to see and interpret the significance of the model, coefficients, p-value, t-value values, confidence interval ​​and more.

OLS model results

To interpret this result, the “R-squared” value, which is one of the most important values, is the success of the independent variable in explaining the variability in the dependent variable. That is, it considers the effect of the independent variables “TV”, “radio” and “newspaper” on the dependent variable “Sales”.

The difference between “adjustment R squared” and “R squared” tends to increase with every variable that comes in “R squared”. But with “adjustment R squared” it gives us a more accurate value by reducing its sensitivity.

The “F-statistic” here tells us the significance of the model after it is established. “Probe (F-statistic)” is the p-value value. The model appears to be significant because the p value is less than 0.05.

coef” values is shown us βp values and “const” is constant coefficient that is β0, 2.9791. “TV” is 0.04447, “radio” is 0.1892, “newspaper” is 0.0028. “P> | t |” When we look at the value for coefficients, because the p value is less than 0.05.(The coefficient of “newspaper” is “P>| t |” Considering the value, it is not meaningful because it is greater than 0.05, the model may not be included.) We see that all of them are again meaningful. “[0.025, 0.975]” is confidence interval. We can say that the coefficients that we found in the model produced significant values ​​with 95% confidence interval.

Note: We cannot examine the prediction success of the model here.

Regression function with OLS statsmodels

As you can see, we can simply write a regression function with the model we use. If there are expenses we want, we can place their values ​​where necessary and see the result of the “Sales” dependent variable.

--

--

Buse Güngör
Analytics Vidhya

Research Assistant at Okan University - Data Science and Machine Learning Bootcamp Participant at Miuul