Linear Regression: Advanced Modeling Techniques & Pipeline [Pt.1]

Kaggle Competition: Predicting House Sale Prices

Vinicius Nala
11 min readFeb 19, 2023

Linear Regression occupies a good part of Data Science area, learn how to build models based on linear regression techniques today — is essential. Although Linear Regressions Models are relatively simple, they can be used to solve many problems with great accuracy.

Not only it’s used in Data Science but also is used in many fields, from business and financial sectors to biology and science. In conclusion, it can be used for a myriad of things.

In this series of articles, I will elucidate the concept of linear regression through the Kaggle Competition: “House Prices — Advanced Regression Techniques”, starting first with the basic linear regression equation, and going to complex analysis, showing in the second part of this article, the most common regressors models. Also, in the meanwhile, I will explain ways to deal with categorical variables and teach how to use a pipeline to preprocess them.

To facilitate your apprenticeship, the notebook is available on my GitHub and Kaggle.

Source Code

Notebook on Kaggle:

Notebook on Github:

Classification Models x Regression Models

First of all, I would like to distinguish Classification Models from Regression Models, which have a key difference: the first one is used to predict discrete values, while the other predicts quantities.

For instance, in my last article, I built a classification model to predict if a person would survive or not on Titanic. In this case, we try to predict two values: 0 and 1 (Survived or Not Survived), so it’s a classification algorithm. In this article, I will build a model that predicts the sale price of a house according to the data provided by Kaggle. So it’s a regression algorithm, which predicts a quantity — in this case, the sale price of a house but could be the salary, age, etc.

Linear Regression Equation

The Linear Regression Equation describes the linear relationship between the independent variables and dependent variables. Below you can see the equation:

Least squares regression produces a linear regression equation, this procedure is a bit complex and there are a lot of videos and articles explaining the process of calculating it. Today, we can simply use software to fit the data to this equation, however, you wouldn’t be a good analyst if you only do this without knowing what is happening behind the scenes.

For this series of articles, I will show the general process when we arrive at the part of building the model.

General Characteristics of the Dataset

Let’s do this way: start coding and during the analysis — when appears convenient — we stop to explain the concepts that are necessary to comprehend what we are doing.

First, collect the data. The training data will be used to make our analysis and train the models; the test data will just be used in the end, to predict the target variable and submit it to Kaggle.

Now, let’s catch a glimpse of the dataset:

Looking at the image above we can see that there are some missing values, 1460 lines, and 81 columns, which is a lot of columns. Excluding the Id column and the target variable which is the “SalePrice” column, we conclude that we can use at least 79 columns to predict the target variable. Along the article, we will learn how to identify the relevant variables, so there are columns that will be useless.

Target Variable Behavior

Above we can see some central tendency measures, now let’s see the distribution of the target variable.

Very interesting, the “SalePrice” variable has a right-skewed distribution. A good way to deal with it is by applying a method to transform it into a normal distribution before starting modeling. There are many methods, and the best one is the one that better transforms into a normal distribution.

In real life, the major part of the data that you are going to deal with comes with an asymmetric distribution, so learn how to convert it to a normal distribution is an important skill.

Why transform a skewed variable into a more normalized variable? When modeling variables with non-linear relationships or a not-so-linear relationship, the transformation can make these relationships more linear, improving the fit of the model to the data. Furthermore, the shape of the distribution plays a role in determining the appropriate method of statistical analysis; if we’re performing a statistical analysis that assumes normality, a transformation might help us meet this assumption.

Measures like skewness and kurtosis are used to describe the spread and height of the distribution. The first one measures the lack of symmetry and the other the flatness of the curve.

Another way of looking at the distribution of the data is using the boxplot graph, where we can clearly see that there are many outliers on the right side of the plot.

Missing Values Overview

After seeing the overview of the dataset and the target variable behavior, it’s good to get a notion of how many missing values the dataset has.

Sometimes these values are missing not because they weren’t collected, but because they mean that the thing doesn’t exist. For example, reading the data description of the dataset, we can see that the variable “Alley” has “Nan” values because some houses don’t have alley access. The columns “BsmtQual” and “BsmtCond” happens the same: have missing values not because they are missing, and the fact that these two variables have the same percentage of missing values reinforces the idea that there are some houses without a basement.

This is why is important to understand the dataset before making a hasty decision of just deleting the null variables or filling it with a random variable.

Therefore, after analyzing the dataset, I concluded that the variables that have more than 80% of missing values will be excluded because they aren’t relevant for predicting the target variable.

Divide Qualitative and Quantitative Variables

An efficient approach to analyze the variables is by dividing the dataset into categorical variables and numerical variables.

This way is necessary because the approach used to analyze if a variable is relevant to solving the problem can differ from a categorical variable to a numerical variable.

I decided to first analyze the quantitative variables.

Quantitative Analysis

Once we are exploring the data for a linear regression model, we should start looking at the correlation between the variables.

In the image above, we can see the correlation between the dependent variable and the independent variables. Let’s see through a scatterplot
if we can see a linear relationship between “SalePrice” and the variables that have a high correlation with it.

It’s very visible that there are linear relationships. Nevertheless, we can see that there are some subtle outliers that might reduce the effectiveness of the model, so it’s better to just remove them.

Well, now let’s just save the name of the columns that we found relevant.

Qualitative Analysis

To do this analysis, I will use a different approach — once dealing with string values are more complicated than numerical values.

First, I would like to introduce two common ways of dealing with categorical values.

Ordinal Encoding

Ordinal encoding assigns each unique value to a different integer.

This approach assumes an ordering of the categories: “Never” (0) < “Rarely” (1) < “Most days” (2) < “Every day” (3).

This assumption makes sense in this example, because there is an indisputable ranking to the categories. Not all categorical variables have a clear ordering in the values, but we refer to those that do as ordinal variables. For tree-based models (like decision trees and random forests), you can expect ordinal encoding to work well with ordinal variables.

One Hot Encoding

One-hot encoding creates new columns indicating the presence (or absence) of each possible value in the original data. To understand this, we’ll work through an example.

In contrast to ordinal encoding, one-hot encoding does not assume an ordering of the categories. Thus, you can expect this approach to work particularly well if there is no clear ordering in the categorical data (e.g., “Red” is neither more nor less than “Yellow”). We refer to categorical variables without an intrinsic ranking as nominal variables.

One-hot encoding generally does not perform well if the categorical variable takes on a large number of values (you generally won’t use it for variables taking more than 15 different values).

When these two approaches appear clear to you, it will be easier to understand what we will do next.

First, let’s divide the categorical variables into two groups:

The next step that we should take is to exclude the variables with high cardinality. When we say cardinality of a variable we are referring to the number of unique entries of a categorical variable as the cardinality of that categorical variable. For instance, the “Street” variable has cardinality 2.

Besides variables with high cardinality can make the model take too long to train if we use One Hot Encoder, and the variables used can be not significant to the model. In virtue of this, I decided that for variables classified as One Hot Encoding Variables, I will only use relevant columns and the ones with cardinality less or equal to six.

After this filter, the remaining variables that we will apply One Hot Encoder will be these ones: [‘Street’, ‘LotConfig’, ‘RoofStyle’, ‘Foundation’, ‘Heating’].

And I will assign the name of the variables that we will do ordinal encoding to the variable ordinal_cols.

The “CentralAir” column is the only categorical variable that is an exception. We won’t apply Ordinal Encoding either One Hot Encoder. According to the data description of the competition, this variable only assumes two values.

So later on we will replace the string values by boolean values (Y: 1, N:0).

Prepare the Dataset

Transformation

Do you remember that we saw that the target variable has a right-skewed distribution? We have to take into consideration that maybe the numerical variables chosen by us might have an asymmetric distribution, so let’s see what we can find.

It seems that the distribution of the other variables are similar to the target variable. Consequently, we will apply a transformation method to all of them.

I made some tests and concluded that the “TotalBsmtSF” variable would be more normally distributed if I apply a square root transformation instead of the common log transformation.
After this, let’s see how our data changed.

The “SalePrice” skew before the transformation was 1.88, now it’s 0.12; the kurtosis was 6.54, and now it’s 0.81.

And the independent variables look much more like a normal distribution than before.

Now let’s compare the linear relationship of the variables through a scatterplot before and after the transformation.

Before the transformation

After the Transformation

Looking at the variables that we did the transformation, we see that they have a more visible linear relationship. Why this? Because the transformation improved the linearity relationship of the variables.

Conclusion

In this article, we understood the difference between a Classification and Regressor model; met the linear regression equation; saw the general characteristics of the dataset and the target variable behavior.

Through this analysis, we concluded that not only the target variable but also some quantitative variables would need a transformation to become normalized variables and learned why to do this.

We also saw the reason why some values are missing, divided the variables between quantitative and qualitative variables, and learned two common approaches to deal with the qualitative variables: Ordinal Encoder and One Hot Encoder.

In the next part of this article, we will start dealing with the missing values; learn how to build and use a Pipeline to preprocess the data; comprehend how works a Linear Regression Model behind the scenes; learn how to statistically analyze the fitting of the model to the data; and will see the most popular regression models.

--

--

Vinicius Nala

🚀 Eternal learner, trying to understand the world through data