Data Science Learners
Nov 5 · 5 min read

Expanding skills as a data science beginner ;House Prices- Advanced Regression Techniques (Kaggle competition)

This blog is specially for those who have just started their journey as a data scientists like us and want to expand their skills beyond the basics. Many of us start with learning a few basics of statistics, mathematics ,algebra and coding with python or R, but the most important skill to deal with any data set coming your way is ‘Understanding the story behind each dataset ‘.

This ‘House price : Advanced Regression Techniques’ competition on Kaggle covers all the necessary aspects of a perfect project to take up the quest for data science to the next level.

This blog explains simple and basic steps for doing any project based on regression problem. This is neither the best code to achieve the lowest error nor the best algorithm to top the leader board on Kaggle, but it can be a simple reference for any beginner to understand the flow of the code.

About the Data:

Train data set is of the shape (1460,80) including target variable. 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa and data scientists are challenged to predict the sales price of each house.

Step 1: Know your data

A) train.info(),train.head() and train.describe() are the basic golden commands which help us get introduced to the variables , shape ,data types of the columns and the basic stats of the dataset.

B)Checking missing values

Columns with more than 80% of the null values can be treated differently. They can either be removed in the first place or can be imputed with the suitable method. In this project we tried both and found out that these columns have less impact on the result and were removed. (PoolQC,Fence, MiscFeatures)

For numeric columns generally missing values are replaced with ‘mean’ and for object data types its generally ‘mode’,’ffill’,’bfill’. But the analyst should understand the nature of the column and then select the suitable method of replacement.

Step 2 : Analysis of the target/dependent variable.

Part of the problem definition is defining the target variable. This is as important as other processes like data preparation, missing value imputation, and the algorithm that is used to build models.

Checking basic stats with target. describe() would only give you a few numbers which may not tell you the orientation of the target variable. Further digging is mandatory.

Graphical presentation of the variable can give us clearer idea of the orientation. Checking skewness is therefore an important process.

We can clearly see this positively skewed variable and hence log transformation may give us better picture of the data.(refer the code above )

Step 3: Dealing with categorical variables

You can’t fit categorical variables into a regression equation in their raw form. They must be treated. This treatment varies with the dataset and our understanding of the dataset as well. Data description files come to rescue in these cases. Most common methods used are map the levels with the number(Lebel Encoding), combine levels or dummy encoding (One Hot Encoding). In this project we tried to map some of the categorical variable which were ordinal in nature (e.g poor, good, excellent).

Though this look a little time consuming, these efforts are worth taking when we are in the initial phase of learning . Remaining categorical variable can be treated with most preferred method of one hot encoding , but not before repeating all the above steps for the test dataset.

‘Dummy’, as the name suggests is a duplicate variable which represents one level of a categorical variable. Presence of a level is represent by 1 and absence is represented by 0. For every level present, one dummy variable will be created. It is very important to combine the train and test data before this step to ensure same no of columns are created and the shape of both ‘Train ‘ ‘Test’ matches.

Step 4: All set to generate model

All the basic steps of data clearance are done, and we are all set to train our model. Train data can now be separated from the combined data using

train_final=New_data.iloc[:1459,:],test_final=New_data.iloc[1457:,:]

(Link for the complete code is given at the end of the blog)

Selection of the model is a ‘trial and error’ exercise for new bees like us. Starting from Linear Regression ,RF regressor to XGBoost . We could achieve minimum error with RandomisedSearch CV with XGB regressor. Code is self-explanatory assuming the reader has a fair idea of all the basic concepts of these algorithms. Hyper parameter tuning is crucial skill and is eventually developed with an experience.

Randomised search CV helps in tuning hyper parameter values that involves checking randomly selected set and finding the optimal hyper parameter values. We can find out the optimal values by selecting best estimators.

Once the model is trained, it is fit on the test for prediction of the target values.

The RMSE value for this model was 0.13310 on submission. Model can further be tuned with different regression techniques and data wrangling methods. This blog is created just to share the knowledge and improve the learning process with the inputs from the fellow learners.

Blog link is https://github.com/vinit-rege/House-Prices-Predictions-Advanced-Regression-Techniques/blob/master/House%20Prediction%20.ipynb

Happy learning.

Dr. Deepali K & Vinit Rege

datascience.learners@gmail.com

References

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html

https://medium.com/better-programming/comparing-grid-and-randomized-search-methods-in-python-cd9fe9c3572d

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade