Data Preparation and Preprocessing is just as important creating the actual Model in Data Sciences- Part 2

Srishti Saha
8 min readOct 9, 2018

--

This article is the second in series which covers my approach in solving the House Price prediction problem on Kaggle.

Source: https://www.kaggle.com/c/house-prices-advanced-regression-techniques

In the last article, we spoke about data exploration in terms of outlier and correlation analysis. We also investigated the target variable ‘SalePrice’ and transformed it to best suit the requirements of the regression model. This article covers two other cardinal aspects: Missing Value Treatment and Feature Engineering.

Before we get on to the mentioned operations, let us just revise how we combined the train and testing dataset to create a dataset called ‘total_df’

The shape or dimensions of total_df come out to be 2912 rows and 81 columns. All rows from the test dataset have a null value in the ‘SalePrice’ column i.e. 1459 rows with nulls under ‘SalePrice’.

Missing Value Treatment

Missing values can cause quite a havoc in any kind of Machine Learning model. Missing values in the dataset, especially in the training dataset can bias the model performance. There are several ways to remove missing values and imputing them with viable substitutes. We shall look at a few of these options here. I first checked for missing values in all the variables in both training and test data.

Distribution of missing values in train and test data

The above table represents the variables that have missing values in the dataset. We can also represent it as a percentage of the total number of records. It can be done using the snippet below, which also arranges the records by the decreasing order of percentage of missing data.

Percentage of missing data

The above table will help us prioritize the order in which we want to tackle each of these variables. However, before we do that, we will remove two variables which we had selected on the basis of the correlation matrix in the previous article:

GarageYrBlt and TotRmsAbvGrd

We also had two other variables which were to be removed:

1stFlrSF and GarageArea

We will treat these variables later, after the feature engineering exercise.

Let us now move on to the techniques to treat missing values in each of the variables. Lack of data (or nulls as we call it) in a few columns imply a significant amount of information, as ironic as it may seem.

Information derived from missing values

Variables like ‘Alley’, ‘Fence’, and ‘PoolQC’ among many others indicate the presence or absence of features in the house. The missing values here basically mean that these features are not present in the house being spoken about here. We have thus, imputed the missing values in these columns adequately:

We now have a few other features that need to be treated for nulls and missing values.

Imputing missing values with numerical alternatives

One significant method of treating missing values in numerical columns is to impute them with groups means, median values or even some hard-coded value based on what makes the most sense. I have done a similar exercise for the next set of columns in the table below:

remaining missing values after the last step

Barring, the 1459 rows with missing values in ‘SalePrice’, we now have only 1 variable with nulls. The column ‘LotFrontage’ needs to be treated for 485 missing values. Imputing 485 rows with a vaguely computed value does not seem right. How else can we do that?

Using techniques like regression to treat missing values

‘LotFrontage’, as per the variable definition is the area of each street connected to the house property. It should thus be dependent on the ‘Neighborhood’ the house is located in. The value of ‘LotFrontage’ also seems to be derived from/ correlated with variables like ‘LotArea’, ‘LotShape’, ‘LotConfig’ among many others. We can thus derive the values of ‘LotFrontage’ as a function of all these variables. We can do that by creating a simple regression model using all other predictors.

After this step, we have successfully treated all missing values in our independent variables. We can now move on to the next step: Feature Engineering

Feature Engineering (and Manipulation)

This step is often touted to be the most important steps before building any kind of Machine Learning model. Features or predictor variables help produce necessary information that helps training a model better. It also helps in reducing the dimensions of a dataset. If you can create a derived feature from 10 independent features and then remove all these 10 features in place of the derived variable, it also helps define the model better without cluttering the input data.

Before we move on to the actual exercise, let us go over a few theoretical aspects of feature engineering:

What kind of problems does feature engineering solve?

Feature engineering in the simplest terms helps improve the model performance. You can use any metric to evaluate the model performance. Feature engineering uses domain knowledge and some common-sense to create variables and predictors that might help predict the target variable better. It is important to note that the final output created in any machine learning model depicts the ‘y’ (target) variable as a function of all ‘x’ (predictor) variables. Hence, better the quality of ‘x’ variables, the closer you are to predicting the ‘y’ variable accurately.

One common notion is that the model performance can be improved using different combinations of models and hyperparameters. That is true! However, feature engineering provides us with the flexibility to use simpler and fewer models, fewer constraints and hyperparameters and still get good quality predictions.

The objective of the feature engineering exercise is to describe the structures inherent in the dataset. Let us see what kind of features can be created to do so.

What kind of features can be created?

One can create simple derived features from single columns already existing in the dataset. For instance, one can create a feature with the logarithmic transformation applied to a numerical column in the dataset. Other mathematical operations like squaring or finding the roots of the values can also be applied to create new features. Now, how does that contribute to the model performance? While the mathematical rationale behind any of these operations will combine both statistical and domain knowledge, I can offer a straightforward explanation here. The target variable might not be directly related to the base variable as is, but might be a derivative of the transformed variable. Creating the feature will thus directly put the predictor in the function and thus help in getting better predictions.

For textual columns, one can create features like flags indicating the presence of any keywords or special characters. I have done a similar exercise in one of my projects where I used project names to derive features indicating the attractiveness or impact of the same on a viewer. You can go through the detailed process here. You can also derive features like the month of the year or the quarter of sale from date-time type variables. You can innovate and create as many derived features depending on the problem statement and the data available.

The next kind of features is cross-column features which include a little more complicated operations using more than 1 variable. Products of 2 or 3 columns can be an example of such features. Finding the age of a house at the time of the sale by subtracting the date of sale from date of establishment is one excellent example of such a feature in our use-case.

Another popular kind of derived features is buckets and percentiles of column-values. Here, as an example, you can use numerical values in a column like ‘price’ to create buckets of <high, moderate, low> priced houses.

Please note that the above examples are not exhaustive. Feature engineering is a creative and time-consuming exercise that requires an iterative approach. There are methods of automating feature engineering, but that requires a lot of resources and some advanced techniques. Now, that we have understood a part of the theory, let us go over the approach I adopted to create features for predicting the prices of the properties.

Adopted Approach

To create features, I browsed the set of predictors showing high correlation with the target variable. I started with the following 4 variables:

OverallQual, GrLivArea, GarageCars and TotalBsmtSF

Let us look into the first variable: ‘OverallQual’. Since regression equations often comprise of quadratic and polynomial components, it would be good to have such predictor features. I thus created three features with the base variable squared and cubed along with the square root of the same. For ‘GrLivArea’, we created a log-transformed variable, for the same reason that led us to transform the target variable. I also created some bucket variables only the basis of percentile distribution of the values under ‘GrLivArea’. Similar operations were done with ‘GarageCars’ and ‘TotalBsmtSF’. With a few additional variables created, the entire exercise can be found in the below snippet:

Newer features will now be created using a combination of variables. Let us go through a few of them.

  1. I used ‘OverallQual’ and ‘OverallCond’ to create a feature ‘TotalHomeQual’. This is basically just the averaged rating derived from the overall qualities and ratings
  2. The next class of features created, were on the area of the property. One significant variable created here was ‘LivingAreaSF’. This is the sum of areas like ‘1stFlrSF’, ‘2ndFlrSF’ along with other basement areas.
  3. I also created timeline based features using date columns. Some of the features so created, include ‘age_at_selling_point’, ‘time_since_remodel’, ‘DecadeBuilt’ and ‘DecadeSold’.

This concludes the feature engineering exercise. It should be noted that I kept dropping the predictors used to create new features from the base data. This is an important step to avoid multicollinearity in the predictor variables.

Final Steps in Data Preparation

Treatment of categorical and ordinal variables

After all the features were created, I used Label Encoding and One-Hot encoding to treat the ordinal and categorical features respectively. You can get a glimpse of the snippet used here:

After the above step ‘SalePrice’ has been dropped from the dataset. This is because it is the target variable and any further steps on the dataset entailing all variables should not be applied to the target variable.

Normalize dataset

The next step was to simply scale the dataset. All variables (now numeric) have been normalized to a range of 0 to 1 using the following snippet:

The above two steps made our dataset ready for model creation and training.

Next Steps

The subsequent article in the series will be aimed towards the brief journey of model creation and training. All the above steps, like the ones in the previous article, have been performed to best suit the data and the problem statement at hand. It is upon you to either adopt these methods or leave them out. However, I reiterate that all these aspects must be investigated before the decision on all these data preprocessing steps can be made.

For the link to my next and final article in this series, keep following this space.

Link to the first article in the series: Data Preparation and Preprocessing is just as important creating the actual Model in Data Sciences- Part 1

--

--