Solving my first regression problem

Published in

Analytics Vidhya

9 min readDec 16, 2020

How I began my journey of data science and the learnings from solving my first regression problem

Photo by Maximillian Conacher on Unsplash

It’s the middle of the pandemic. My friend casually mentions Data Science in one of our conversations. Being a computer science graduate, I have heard the terms ML, AI, DS etc a million times. For some reason, I had always abstained from trying this field out. But the field had always fascinated me. The world of predicting things and computers taking decisions themselves always felt sort of magical. Bored due to the lockdown and travel restrictions, I decided to give it a try. I started out by googling about how to go about things in the world of Data Science. So I got my dose of motivation from the source that every millennium prefers, random Instagram videos, and started Andrew Ng’s Machine learning introductory course on coursera. And thus, began my journey towards data science (pun intended).

So there I was, done with the first step of being a Data Scientist. The important question was, what next? My friend suggested that I should pick an area within data science and try to solve a few elementary problems in that area. The various areas within Data Science include regression, classification, natural language processing, time series analysis etc.. Since I always find the act of a computer by itself predicting things quite fascinating, I decided to go with regression.

Overview

In this article I will take you through my journey and explain the steps of solving your first regression problem. We will be implementing the following steps:

Picking a data set
Pre-processing of data
Implementing regression algorithms like linear regression( Ridge and Lasso) and XgBoost Regression
Picking the right model

Note: For this implementation, we will be using Python programming language along with a few libraries like pandas, matplotlib, sklearn etc.

Picking the Right Data Set

Being a beginner, I did not want to involve myself massively in the data engineering and preprocessing parts as of now. I started looking around for cleaned datasets for regression problems on Kaggle. I found a data set about housing prices from the State of Iowa in the USA which looked fascinating and decided to try the problem out. The dataset is present here. Some of the features are the ones you would expect like the area of the house, the neighbourhood, utilities present etc. But the dataset also contains features which go into a lot of depth about each house like fireplace quality, garage type, basement condition etc.

There were a few reasons for picking this dataset :

This is in the ‘Getting Started’ section of Kaggle, hence I knew that this would be beginner friendly.
I checked out the data set and found that it was largely complete hence it required minimal preprocessing.
The description of the problem stated that it would be ideal to learn advanced regression techniques.

If any of the above mentioned factors differ for you, I would suggest you to find a data set of your choice. For Example, if you have solved a few regression problems and are aware of different libraries, you could pick an unclean data set to learn more about preprocessing techniques.

For this particular data set, we will use the read_csv function of pandas library to import the dataset :

import pandas as pd
trainData = pd.read_csv('<path_to_file>')
testData = pd.read_csv('<path_to_file>')

Preprocessing the Data

The data set contains features which have alphanumeric characters, 1 digit numbers, 5 digit numbers, NAs etc as values. Hence, it is very important to get the data in a format such that none of the variables are skewed and the algorithms perform to the best of their abilities. Initially, as a beginner all I had heard about was Normalization. But as I started reading guides about normalization, I found out that log transforming values is also a method to reduce the skew of the values of variables. Log transforming is the process of considering the logarithmic values instead of the actual value of the variable. For this problem, when I compared the results of log transformation against normalization, the results with log transformation were marginally better. Hence we will go with log transform here :

# Joining train and test data to preprocess, will separate out laterall_data=pd.concat((trainData.loc[:,'MSSubClass':'SaleCondition'],testData.loc[:,'MSSubClass':'SaleCondition']))

First, taking log of the housing prices in the training set :

#log transform the target values
trainData["SalePrice"] = np.log1p(trainData["SalePrice"])

Then, we find out the training variables which have a skew > 0.75. These values will be log transformed as well :

#log transform the target values
trainData["SalePrice"] = np.log1p(trainData["SalePrice"])#extract features which are pure numeric
numeric_feats = all_data.dtypes[all_data.dtypes != "object"].index#calculate skew of features using lambda functions
skewed_feats = trainData[numeric_feats].apply(lambda x:skew(x.dropna()))#extracting indexes features which have skew > 0.75
skewed_feats = skewed_feats[skewed_feats>0.75]
skewed_feats = skewed_feats.index#log transform features
all_data[skewed_feats] = np.log1p(all_data[skewed_feats])

The present dataset contains values which are both non numeric and numeric. To use the various python libraries available we need to convert the non numeric to numeric values. Here, We will use something called one-hot encoding. One-hot encoding breaks down one variable into several indicator variables. For the explanation of one-hot encoding along with a few examples check out https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html.

Now we will convert our non numeric values to numeric values. This is easily done by using the pandas function get_dummies :

#Dummy Data
all_data = pd.get_dummies(all_data)

Next, we will fill in all the NA values by the mean value of that column :

#Fill NAs with mean of column
all_data = all_data.fillna(all_data.mean())

The final step is to convert split pre processed data back into training and test samples :

#creating matrices for sklearn
X_train = all_data[:trainData.shape[0]]
X_test = all_data[trainData.shape[0]:]
y = trainData.SalePrice

As an additional step of preprocessing you can try to check the feature descriptions file and decide if you want to remove a few features from the training data if you think that they might not be extremely useful. I would recommend doing this after implementing Lasso Linear Regression as it will help you decide if a feature is useful or not(more on that later).

Done! The data is now in a usable format. Now we move to the very interesting part, the regression algorithms

Regression Algorithms

In the following methods, we will be choosing the root mean square error(rmse) as the metric to measure the performance of a model. The rmse should be as close to 0 as possible.

Linear Regression — Ridge Regression( L2 Regularization)

Ridge regression is a variation of Linear Regression. Ridge regression is usually applied when variables in the data set suffer from multicollinearity. In ridge regression, we generally end up adding a degree of bias which prevents overfitting of the data set. The reason why Ridge regression is known as L2 regularization is that the penalty term has a squared magnitude of coefficient. To know more collinearity, read here : https://ncss-wpengine.netdna-ssl.com/wp-content/themes/ncss/pdf/Procedures/NCSS/Ridge_Regression.pdf and to know more about the inner workings of ridge regression watch : https://www.youtube.com/watch?v=Q81RR3yKn30&ab_channel=StatQuestwithJoshStarmer

Coming to the python code of ridge regression, we first implement a method which gives us the cross validation score of a particular linear model

def rootMeanSquareError_CrossValidation(model):
    rmse_negative= cross_val_score(model, X_train, y, scoring="neg_mean_squared_error", cv = 3)
    rmse = np.sqrt(-rmse_negative)    return(rmse) # The returned rmse is array of 3 numbers as the cross validation is done 3 times, as signified by the cv=3 parameter

Now, we implement the learning model. The Ridge regression model takes a parameter alpha which is known as the degree of bias. We loop through an array of alpha values and plot a graph of root mean square error versus alphas.

alphas = [0.05, 0.1, 0.3, 1, 3, 5, 10, 15, 30, 40]ridgeReg_cv = []for alpha in alphas:
    temp = rootMeanSquareError_CrossValidation(Ridge(alpha)).mean()
    ridgeReg_cv.append(temp)ridgeReg_cv = pd.Series(ridgeReg_cv, index = alphas)
ridgeReg_cv.plot(title = "Validation")
plt.xlabel("alpha")
plt.ylabel("rmse")

Now, we find out the minimum value of root mean square error from this graph:

ridgeReg_cv.min()
Output : 0.12834043288009075

We see that the rmse values is 0.12834043288009075. We will later use this value to compare the performance against other models.

Hurray! We have our first regression model implemented. Now, onto the next one.

2. Linear Regression — Lasso Regression ( L1 regularization )

Like ridge regression, Lasso Regression also adds a bias term to add bias to the learning process. The main difference between both the methods is that in Lasso Regression, the bias term has the modulus of the coefficient. Since the degree of regularization is 1, lasso regression is called L1 regularization. Due to the inner working of lasso regression, there is a possibility of the coefficient of a variable becoming ZERO. This means that in a Lasso regression model, there is a possibility that a feature might be eliminated from the learning process. This is not the case in Ridge regression. To know more, watch https://www.youtube.com/watch?v=NGf0voTMlcs&ab_channel=StatQuestwithJoshStarmer

In the code, we will use the LassoCv method with a list of alphas. Also we will use our method defined above to find the root mean square error :

model_lasso = LassoCV(alphas = [5, 1, 0.1, 0.01, 0.001, 0.0005]).fit(X_train, y)rootMeanSquareError_CrossValidation(model_lasso).mean()Output rmse: 0.1242464172089154

In this case, the Lasso model performs slightly better than the ridge regression model.

As mentioned above, Lasso regression can help us in determining which of the features in the dataset are actually being used. We will pick the top 5 features which are being used and the 5 features which are being least used by Lasso Regression model and plot them :

coef = pd.Series(model_lasso.coef_, index = X_train.columns)importance = pd.concat([coef.sort_values().head(5),coef.sort_values().tail(5)])matplotlib.rcParams['figure.figsize'] = (7.0, 8.0)importance.plot(kind = "barh")plt.title("Coefficients of features in the Lasso Model")

Here is the output :

As you can see from the graph above, the feature given the most importance is ‘GrLivArea’ which is the living area square feet and the least used feature is ‘RoofMatl’ which is roof material. If you think about it, this is kind of intuitive because when buying a house, a person would first focus on the living area of the property along with the neighbourhood and would consider the material of the roof as a bit of low priority.

Yay! We implemented our second linear regression model!

3. XGBoost

This is the third and final regression model that I implemented. Before moving on, I would highly encourage you to check out decision trees and random forests as the XGBoost model is sort of related to these two topics.

XGBoost is a very powerful model. It uses the concept of Boosting. Rather than training each model of isolation of each other, the algorithm iteratively creates new random forests of a certain predefined depth to essentially, correct the errors made by the previous model. XGBoost is extremely fascinating! To learn the inner workings of the model watch : https://www.youtube.com/watch?v=OtD8wVaFm6E

Coming to the code, we first convert our train and test data into a format which is accepted by xgboost. The learning of this algorithm is different from Ridge or Lasso Regression. In XGBoost, we first use the xgb.cv method to hypertune the various parameters to get the best possible model. Here I have tuned my parameters max_depth, eta and min_child_weight using a trial and error method. To do a grid search for the best parameters , check out https://blog.cambridgespark.com/hyperparameter-tuning-in-xgboost-4ff9100a3b2f and https://machinelearningmastery.com/tune-number-size-decision-trees-xgboost-python/

import xgboost as xgbxg_train = xgb.DMatrix(X_train, label = y)
xg_test = xgb.DMatrix(X_test)params = {"max_depth":2, "eta":0.1, "min_child_weight":1}
model = xgb.cv(params, xg_train,  num_boost_round=1000, early_stopping_rounds=50)print(model['test-rmse-mean'].min())Output rmse : 0.123212

Now that we have learnt the best parameters, we will use the XGBRegressor with our best parameters

final_model_xgb = xgb.XGBRegressor(min_child_weight=1,n_estimators=360, max_depth=2, learning_rate=0.1)final_model_xgb.fit(X_train, y)

Congratulations! We have now successfully deployed our third regression model!

Picking the right model

To pick the right model, we can generally go with the model giving the least amount of root-mean-square-error as this will be the model which will predict the results closest to the real value. Just to summarize the rmse value of our 3 models :

Ridge Regression : rmse = 0.12834043288009075
Lasso Regression : rmse = 0.1242464172089154
XGBoost : rmse = 0.123212

As per the performance metrics, I chose the XGBoost model as it has given me the best performance.

After I was done with finalizing my solution, I found out there was another way of trying to pick a final model. A weighted average of two or more models can be taken. To code that out, we will first make our predictions with the models we want to consider and then find the weighted average of them :

lasso_predictions = np.expm1(model_lasso.predict(X_test))
xgb_predictions = np.expm1(final_model_xgb.predict(X_test))preds = 0.35*xgb_predictions + 0.65*lasso_predictions

That’s it! Just to recap, we just implemented 3 different regression models and learnt so much about the inner workings of these algorithms. I would highly encourage this activity to any Data Scientist beginning their journey. All the best!