Gain key machine learning skills with a beginner-friendly project

Raffaele Nolli
Analytics Vidhya
Published in
8 min readMay 5, 2020
Heat map of the property prices in Ames, Iowa.

As an aspiring data scientist, one of the most important things is to find the right kinds of personal projects —those that allow you to draw meaningful information from data while boosting your learning.

In this article I share my experience of building a project around a commercial dataset, which I used to create a model that can predict house prices with a root mean absolute error of around $39k. This project helped me to build my data cleaning, feature engineering and predictive modelling skills.

A good place to start is Kaggle, which has thousands of datasets to choose from. I picked a dataset about houses sold in Ames, Iowa, and decided I would aim to build a simple and effective predictor for house prices in Ames, based on the houses’ descriptive features, as presented in the dataset.

Why this dataset? Because it looked like data cleaning and feature engineering would be a challenge, and it was a good playground dataset. To achieve my goal, I had to go through all the major steps of the data science production process, as explained in this Medium article, apart from data collection and commercial deployment. I followed this process:

By using a modular and simple project structure, I intended to make my work easy to understand and follow, so that anyone could provide feedback or suggestions for improvement, or use it as a reference for their own ideas. I hope it will inspire you to pursue your own ML goals.

For more information, please check out the project repo.

Data Cleaning and exploration

If you take a look at the original data, you might think someone pulled a bad joke on Kaggle users: 79 descriptive variables, 43 of which are non-numerical categorical, and many without any clear order or hierarchy of values. The data description file attached is a bit messy and there are typos here and there.

Some non-numerical variables have clearly ordered values (e.g. good, bad, excellent, etc.), hence integer encoding and treating missing values is easy; in other cases, when dealing with unordered, non-numerical categorical variables, a good strategy is one-hot encoding, however in this case it would produce a dataset with hundreds of columns, most of which were of little consequence.

See the dataset documentation for information on the variable values.

I decided to perform a thorough analysis of all the variables, and some data exploration, in order to find an order for variable values where present, but not explicit. As an example, building materials have different costs and have an impact on the sale price, and it makes sense to encode them with a price-dependent order. Note that this is an arbitrary choice, and other strategies could be chosen instead.

I wanted to see what the target variable SalePrice looked like, and it nicely fits a skewed normal distribution, which is consistent with the house price distribution at a national level. This means that the data engineering package and the predictive model developed within this project are easily exportable and not specific to the town in this dataset.

Feature Engineering

Location, location, location

The location information for each house is provided as a code referring to a neighbourhood (or a geographic reference), and in order to make some sense of it I decided to fill a lookup table with the actual names of neighbourhoods, and calculate the distance to the town centre by obtaining the neighbourhood coordinates with the geopy package.

By using Google Maps tools, such as the gmaps package, I was able to generate a house price heat map of Ames, and you can find it at the top of the post.

In this way, I intend to define a method that is easily replicable for any kind of location of interest that, in a town of choice, may have an impact on the property prices.

Dimensionality reduction

If anyone took a close look at the dataset, they would see that a few of the 79 descriptive variables are redundant, or not relevant for most cases. As an example, there are several quality scores for different parts of the house, and it is reasonable to assume that if a house’s quality score is high, all or most scores will be high; and the surface extension of a second floor is only relevant if there is a second floor, which may be a minority case.

Another reason to perform feature reduction is to get a first indication on the important features in the dataset, i.e. those that carry useful information (in terms of variance) and that can help with model explainability.

The main goal of this part of the project was to reduce the dimensionality of data and, as a way of learning, I decided to try a few routinely used methods:

  • selection of the features highly correlated (i.e. corr. coefficient >0.5) with the target;
  • identification of highly correlated variables (i.e. corr. coefficient >0.8), and discarding of redundancies;
  • principal component analysis, limiting the number of components to retain 95% of the sample variance;
  • feature selection through a wrapper function, by backward elimination (fit the data with a regressor function, and iteratively remove the least significant feature) and recursive feature elimination;
  • feature selection through LassoCV.

I packaged the feature engineering methods neatly in a module, which is available in the project repository, so that anyone could apply them to a dataset structured like the provided one, should you happen to find one.

A Baseline Model

I used a simple linear regression model to assess the dimensionality-reduced datasets, and to have a baseline for the predictive model. The Kaggle competition ranks submission for this project according to the root mean square error of the prices logarithm, so that the evaluation metric is not unbalanced towards higher prices.

The attached table shows how the baseline model did when tried with the complete dataset and with the dimensionality-reduced ones. As you can see, there are a few datasets that perform better than the others, and I am going to use them for my predictive model.

Again, this is a subjective choice and you could use just one of the dimensionality-reduced datasets, or all of them, and in doing this I am merely showing my evidence-based decision process.

My Predictive Model

For my predictive model I chose the Stochastic Gradient Descent (SGD) regressor, following the advice of the scikit-learn algorithm map. Compared to other algorithms, SGDregressor is relatively simple to implement and tune, well-suited for the kind of dataset available, and provides the feature coefficients for the fitted function, which can be useful for analysis.

There are many valid alternative algorithms for this kind of problem, such as LightGBM, CatBoost and XGBoost, and if anyone out there would like to try them, please let me know what happens. I tried building a simple neural network predictor, but it consistently underperformed with respect to SGD. I am a big fan of Random Forests(RF), so I might add a RF-based predictive model to the project at some point.

After a round of parameter optimisation with GridSearchCV, I generated train/test splits of the datasets to test my model, and I chose the two best performing versions of the dataset (with dimensionality reduced through backward elimination, and principal component analysis) to make my submission to the Kaggle competition.

From the residuals plot, it can be seen how the model scores might be thrown off by outliers, which may regard houses sold in particular situations not captured by the descriptive features, or best captured by some discarded features.

The inconvenience of PCA is that the new components of the dataset cannot be easily related to any feature in the initial dataset, and while this might work well enough with a predictive model, it makes it impossible to gain any insight from the feature coefficients. Luckily, I can use the dataset reduced through backward elimination, and obtain information on the features’ importance.

Please refer to the data description file for more details about the features.

Unsurprisingly, the main features affecting the house price are quality scores (overall quality, kitchen quality,etc.), size (total rooms above ground, ground living area, etc.), whereas the age of the house obviously impacts negatively on the house price, although less than quality and surface.

Project Output and Key Takeaways

In this project I have taken a commercial dataset, manipulated it, and used it to build a model that can predict house prices with a root mean absolute error of roughly $39k, as returned by the Kaggle competition submission evaluation page, which is approximately 24% of the median house price within the dataset.

The SGD predicting model does not seem to perform very differently from the baseline model, and this could mean that I am not making the most of the algorithm’s potential. The baseline model, however, predicts house prices with a root mean square error of $69k, which probably means it is overfitting. As a next step, it would be interesting to push the model optimisation a bit further, or try new algorithms.

While working on this dataset, I have used quite a few data science techniques and key tools for data cleaning, feature engineering, modelling, and data analysis. At the end of the project, I saved the functions and algorithms used in a test package, which I uploaded to TestPyPi.

About me

I am a Data Scientist currently looking for new opportunities. In the past few years I have been working on applied quantum techonologies for space applications. Feel free to send me a message if you’d like to talk about this or other projects on my GitHub page.

GitHub: https://github.com/RaffaToSpace

LinkedIn: https://www.linkedin.com/in/raffaele-nolli-581a4351/

--

--