Mastering Regression Techniques: Paris Housing Price Prediction
Predicting house or item prices using linear regression can be considered the ‘hello world’ of machine learning. It essentially introduce beginners to some fundamental concepts and terminologies used in the field. A couple of days ago, I participated in my first-ever competition on Kaggle — Prediction Competition Season 3 Episode 6. Though the competition has closed already, my goal was not to be on the leaderboard — it was an opportunity to dive deep into regression, cultivate practical skills, and understand the nuances of predictive modeling. In this article, I’ll take you through my experience with the Paris Housing Price Prediction Kaggle competition, sharing insights into my approach, the dataset, the notebook walk-through, and the models I explored.
Overview
The competition is part of the Kaggle Playground Series. The goal is to give the Kaggle community a variety of fairly light-weight challenges that can be used to learn and sharpen skills in different aspects of machine learning and data science. You can find my project notebook here.
Dataset
The dataset used synthetically dataset generated by training a deep learning model on the Paris Housing Price Prediction. Although synthetic, the dataset closely resembles real-world data without revealing test labels publicly.
Notebook Walk-through
Below is an outline of the processes carried towards completing the project
- I started by downloading the datasets directly from Kaggle using the
opendataset
library.opendataset
is a Python library for downloading datasets from online sources like Kaggle and Google Drive using a simple Python command. It saves the time of manually downloading and extracting CSV files. - After downloading the data, I performed some data assessments to check for missing values, invalid column datatype, duplicated values, and so on. In addition, I looked at some statistical information about the data which includes the standard deviation, percentiles, and so on.
- Next, I performed an exploratory data analysis using plotly, seaborn and, matplotlib in order to uncover insights and relationships in the data. During the EDA, I was able to visualize how the different columns correlate with the target column. The color side bar on the far right shows the correlation coefficient of column. The size of a house in terms of square meters seem to be a major factor which determines its price.
- Part of my EDA also include checking for potential outliers in the data
- Moving on from eda, I preproessed the data for modeling. First by removing outliers using quartile and percentile. Then I scaled numeric columns using the MinMaxScaler from sklearn. Also, categorical columns were encoded and finally splitted the dataset into training and test set.
- The data was then trained and evaluated on five different models using the appropriate evaluation metrics.
Model
I trained five different regression models on the dataset: linear regression, random forest regressor, ridge regression, lightgbm, and XGBoost. Also, I performed hyperparameter tuning on the XGBoost model. The R2 score equating to 1 was not surprising due to how the data was generated.
After comparing the models, I performed hyperparameter tunning of the XGBoost model. At the end of the model-building phase, the tunned XGBoost performed best with a least root mean squared error of 135002.53
Conclusion
My experience completing this project was a gentle introduction to regression techniques. Moving on, I aim to further learn more techniques with respect to building optimal models which perform better and hopefully make it to production 😫