Linear Regression: Pythonic Approaches to Simple Linear Regression

Indrani Banerjee
CodeX
Published in
7 min readOct 29, 2022

As I’ve mentioned before, I’m a big fan of Data Camp courses for learning, keeping up, and testing my knowledge. However, one of the things that really started overwhelming me was the various different packages and approaches the different courses would use, sometimes to achieve the same goals. So, as a continuation from my last post, I wanted to take this chance to summarise two packages I’ve started using for linear regression projects: Statsmodels and Scikit Learn. I’m going to use the USA Housing dataset to talk through how to use both, you can find the notebook here.

The Data

The sample dataset has 5000 entries with 7 features: Average Area Income, Average Area House Age, Average Number of Rooms, Average Number of Bedrooms, Area Population, Price, and the Address. It’s available for public use, so definitely feel free to check it out and have a play with it yourself. I’m going to explore whether there is a linear relationship between the features and the price of the houses. So, how do we know which features to look at?

I find pair plots are a great way of quickly visualising the data before doing anything else because it gives me a quick idea of what obvious trends exist in the dataset. Check out this article on pair plots if you want to learn more.

I used Seaborn’s pair plot to visualise the data as it automatically recognises which features are numerical and creates a grid showing the pairwise relationship between the numerical features. Another interesting feature is that across the diagonal of the grid, it plots the univariate distribution (how the data for each feature look on their own).

From here we can see that there are some features that seem to show a somewhat positive correlation: Average Area Income and Price, Average House Age and Price, Average Area Number of Rooms and Price, Area population and Price.

Now, if you look at these four plots closely (the first, second, third and fourth plot of the last row) we can see that whilst they do have similarities in their trends, the data points for the Average Area Income and Price plot are much more tightly packed together compared to the other three. This is where the question of ‘how correlated’ the features are arises- we’ll explore this once we have a look at how to do the simple linear regression!

Cleaning the Data

Like many statistical packages, linear regression models with most python packages struggle or just don’t work if our data is not complete. I won’t go into the huge amount of detail here about data cleaning, but a few things to look for are that you don’t have missing data, and the data you’d like to perform linear regression on are numerical and are stored as numerical. I used pandas to quickly check the data was suitable for linear regression.

I’d recommend pandas .info() method, it’s a really easy way of checking out data types. We can see that all but one of the columns are numerical (figure below), and, as expected, the address is not. I went ahead and dropped the address column at this point.

Also, the two methods: isnull().sum() when chained together will give a quick summary of columns with missing data.

Great! There’s no missing data so we can now move onto our regression analysis.

Splitting the Data

It’s common practice to split the data into two sets: training and test sets. The idea here is to build a model with the training set first. Imagine we put all the ordered datapoints (each average area income and its corresponding house price) into a bag. We randomly select 80% of them and plot a scatter plot, and now try to draw our line of best fit, a regression line, using only these points. This is basically called training the ‘model’. For linear regression, say for the house prices and the average area income, we expect to get an equation:

Here the first term is the y-intercept of the line and the second is the coefficient.

How is the line obtained? There are quite a few different ways, but the most common method, and the method I used, is least squares regression. Once we have our equation for our regression line, we take the remaining 20% of our datasets and plug in the Average Area Income values (the x-values) into our equation, recording the house prices our regression line gives. Finally, we can compare the prices predicted by the regression line and the actual prices to gauge the quality of our linear regression model: the predictive power of our line!

The easiest way to split up the data into these two sets is using ‘train-test-split’ from Scikit Learn:

Now, we Linear Regress!

As I mentioned earlier, I want to use this post to get an idea of when to use Statsmodel and when Scikit Learn would be more appropriate. As we are going to look at a simple linear regression, so only one feature, I’ll just look at the Average Area Income as the feature and how it influences house prices.

Scikit Learn

Using Scikit Learn there are a couple of things to set up. I’ve gone ahead and just taken the Average Area Income column and set it as the feature we want to explore. We would have used the whole X_train set if we were looking at multiple linear regressions, but for simplicity let’s stick to one feature for now.

The model is pretty easy to fit and only a few lines of code give us the metrics of our model.

The coefficient and the intercept for the linear model is pretty easy to obtain using Scikit Learn:

Now, let’s have a look at a regression plot where we can compare the actual house prices to the house prices predicted by our model. These are pretty easy to do, feel free to check out my notebook for the code.

We do see that near the centre most of the points are scattered quite close to the red line- this indicates that most of the predicted house prices were fairly close to the actual house prices in the dataset. We do see more of a deviation near the ends, so the model doesn’t work as well for the entire range of data we have.

Let’s have a look at the residual plot to see if a linear model is appropriate or not.

We don’t see any obvious patterns here, which indicates that the linear model is an appropriate model for the dataset.

Statsmodels

Let’s see how we would do the same analysis as before using Statsmodel. I’m a big fan of Statsmodels because I find it much more convenient to sue when you just want to learn about the dataset.

It’s important to note that there are subtle differences to the code. For statsmodels Ordinary Least Squares (OLS) model we need to use the “dependent variable ~ independent variable”) format. The .summary() method is great as it takes only seconds to deliver a very detailed insight into the data. I want to spend some time discussing what all these outputs mean and their relevance, but I’ll leave that for another post. For now, I want to highlight the R-squared values, the coefficient and intercept of the linear regression model. If we want to now have a look at residual plots, statsmodel graphics module makes this incredibly convenient.

There is lot to unpack here in terms of statistical analysis, which is outside of the scope for this article. I am hoping the real takeaway from this post is to see how Scikit Learn and Statsmodel have very clear but different focuses when it comes to statistical analysis: Scikit Learn is much more efficient and convenient to use for prediction whilst Statsmodel is fantastic when trying to gain deeper insights into the data. Stay tuned as I focus on interpreting the various different metrics these two packages throw at us when we look at interpreting linear regression next week!

--

--