House Prices Analysis: Advanced Regression Techniques

Akhil Sharma
10 min readJul 16, 2023

Table of Contents:

  1. HOUSE PROBLEM
  2. EXPLORATORY DATA ANALYSIS
  3. DATA PREPARATION
  4. BUILDING MODELS
  5. MODEL PERFORMANCE SUMMARY
  6. SIMULATION OF FINAL SUBMISSION

1. HOUSE PROBLEM

Description

Ask a home buyer to describe their dream house, and they probably won’t begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition’s dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

In this Completion, participants are presented with a comprehensive dataset related to house prices. The dataset includes various features and attributes of residential properties, such as the number of bedrooms, square footage, location, and other relevant factors.

The dataset was carefully curated to provide a diverse representation of houses from different regions. It aims to capture real-world scenarios and challenges faced in predicting house prices accurately. The dataset also contains additional information such as sales dates and prices, allowing participants to analyze trends and patterns over time.

Participants are encouraged to leverage advanced regression techniques to develop predictive algorithms that can accurately estimate house prices based on the given features. The objective is to create models that can generalize well and effectively capture the complex relationships between the independent variables and the target variable, which is the sale price of the house.

The competition aims to showcase the ability to predict house prices reliably using the provided dataset. Successful models would shed light on the factors that significantly influence house prices, providing valuable insights for real estate professionals, home buyers, and sellers alike. Additionally, the competition fosters the exchange of innovative approaches and techniques in the field of regression analysis and housing market research.

2. EXPLORATORY DATA ANALYSIS

Data Description

As you can see in the image the House Prices: Advanced Regression Techniques dataset is used which contains 4 files

  1. train.csv (which has your training data).
  2. test.csv (which has your testing data).
  3. data_description.txt (which contains description of the attributes of the data like which different categories does a particular attribute possess).
  4. sample_submission.csv (its a sample submission file to let you know that your predicted file should have following format).

After extracting the file here’s the train data

Analysis on Data

df.isnull().sum().sort_values(ascending=False)

Using the isnull() function we can see that maximum entries of the column PoolQC is empty. So we can drop this column from the main dataset

  1. MiscFeature: Miscellaneous feature not covered in other categories. In this category also maximum entries are Null.
  2. Electrical column does not improve the accuracy of the model. so we can remove it in the feature selection

Note: Bedroom and kitchen are not a parameter in the dataset they are just abbribiation in the data.

Alley column is also empty and also does not improve the accuracy of the model

After tried to find

1. Categorical values having one value that will not be helpfull

droped_columns_2=[]
for col , num in zip( df.astype('object').nunique().index,df.astype('object').nunique().values):
if num ==1:
droped_columns_2.append(col)

droped_columns_2

Output [‘Alley’, ‘PoolQC’, ‘Fence’, ‘MiscFeature’]

Relation of Alley with Other Features

2. Deleting the duplicate Columns

No such value found luckily 😎

df.duplicated().sum()

lets try exploring numerical features

So in this we focused on multicollinearity as this will lead to loss in training a data that is not much needed in the model

Used HeatMap, for features selection

High correlation Features

[‘YearBuilt’,’GarageYrBlt’] [‘1stFlrSF’,’TotalBsmtSF’] [‘GarageCars’ ,’GarageArea’] [‘GrLivArea’,’TotRmsAbvGrd’]

df.drop(columns=['GarageYrBlt','1stFlrSF','GarageArea','TotRmsAbvGrd'],inplace=True)

Catching the Outlier Phase 🧐

Outliers are data points that significantly deviate from the overall pattern of a dataset. They are observations that are distant from other observations and can have a disproportionate impact on statistical analyses.

These can affect the accuracy of any model as it affect the varience of the mean, leading to biased results

These are features having ‘TotalBsmtSF’,’GrLivArea’ a lot of outlier 🤧🤧

Lets Catch them😎

df.drop(columns='Id',inplace= True)
mask1=df['TotalBsmtSF']<2050
mask2=df['TotalBsmtSF']>100
mask3=df['GrLivArea']<2800
mask4=df['GarageCars']<3.8
mask5=df['OverallQual']>1.8
DF=df[mask1&mask2 &mask3&mask4&mask5]

Lets Examine the Skewness of Data

Skewness is a statistical measure that quantifies the asymmetry of a probability distribution on dataset. It provides information about the shape of the distribution and the extent to which it deviates from a symmetric, bell-shaped curve.

Skewed data affect the performance of any model so lets see the distribution of the features and fix the skewed data

Note: 
Data is symmetrical: skewness is between -0.5 and 0.5
Data is slightly skewed: skewness is between -1 and -0.5 or 0.5 and 1
Data is highly skewed: skewness is less than -1 or greater than 1.
# select columns with skew()>1 or <1

sk=[]
for i in df.drop(columns=[ 'YearBuilt','YearRemodAdd']).select_dtypes('number').columns:
if ((df[i].skew()>1) or (df[i].skew()<1)):
sk.append(i)

# update values of skew data from x to log(x)
np.seterr(divide = 'ignore')
sk_=pd.DataFrame(np.select([DF[sk]==0, DF[sk] > 0, DF[sk] < 0], [0, np.log(DF[sk]), np.log(DF[sk])]),columns=sk).set_index(DF.index)
df_skew=DF.drop(columns=sk).set_index(DF.index)

df_skew=pd.concat([df_skew,sk_],axis=1)
X_train_skew=df_skew.drop(columns='SalePrice')


# select columns with skew()>1 or <1

sk_t=[]
for i in test.drop(columns=[ 'YearBuilt','YearRemodAdd']).select_dtypes('number').columns:
if ((test[i].skew()>1) or (test[i].skew()<1)):
sk_t.append(i)
sk_t=pd.DataFrame(np.select([test[sk_t]==0, test[sk_t] > 0, test[sk_t] < 0], [0, np.log(test[sk_t]), np.log(test[sk_t])]),columns=sk_t).set_index(test.index)
df_skew_t=test.drop(columns=sk_t).set_index(test.index)

df_skew_t=pd.concat([df_skew_t,sk_t],axis=1)
X_test_skew=df_skew_t.reindex(columns=X_train_skew.columns)

But, I did not applied the skewed data in the real data as i want to train model with this data and the real data (i did it with my self and it give more accurecy with new updates) I want to if it will do better than the real data if not we will continue as things is not changing😋.

3. DATA PREPARATION

Description

Impution of Missing Values

Dealed with the Categorical features missing values separately.

# start with Categorical Features
from sklearn.impute import SimpleImputer
imp_cat = SimpleImputer(strategy="most_frequent")

X_Cat= df_final.select_dtypes('object')
X_data_cat = imp_cat.fit_transform(X_Cat)

Dealed with Numerical Data separately.

# Numerical Features
from sklearn.impute import SimpleImputer
imp_num = SimpleImputer()

X_num=df_final.select_dtypes('number')
X_data_num = imp_num.fit_transform(X_num)

One-Hot Encoding

“Hot cats, cool dogs! One-hot encoding: making categorical data ML-friendly. No bias, just distinct columns. Meow, bark, woof! Let’s convert categories into numerical paw-someness. No more cat-astrophes in algorithms! Embrace the power of one-hot encoding and unleash purr-fect predictions!”

from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(drop='first',sparse_output=False)
df_ohe = ohe.fit_transform(X_Cat)

#At the End the concatenated all the data using numpy.concatenate()
X_Data=np.concatenate([df_ohe,X_data_num],axis=1)

4. BUILDING MODELS

A head-Up implemented all the ML Models (I knew) on this data and gave their analysis upon how fit that are on the data and at the end predicted the best Model for Root Mean Squared Error(RMSE) possible for this dataset.🙂🙂

from sklearn.metrics import mean_squared_error,r2_score

# Splitting the Data into train and test data
X_train,X_test,y_train,y_test = train_test_split(X_Data,df['SalePrice'],
test_size=0.2)

List of Implemented Models

  1. KNN Algorithm
  2. Naive-based Model
  3. Support Vector Machine (SVM)
  4. Decision Tree
  5. Random Forest
  6. Gradient Boosting

1. KNN Algorithm

“KNN, the friendly neighbor of machine learning. It’s a simple yet powerful algorithm. KNN stands for k-nearest neighbors. It classifies data based on its closest neighbors in feature space. Whether predicting genres or identifying anomalies, KNN’s got your back. Just count the neighbors, find consensus, and make accurate predictions. It’s like having a helpful neighbor next door in the world of ML!”

from sklearn.neighbors import KNeighborsRegressor
knn = KNeighborsRegressor()

Hyper-Parameter Tunning

error_rate = []
[True, False, ... ]
# Will take some time
for i in range(1,40):

knn = KNeighborsRegressor(n_neighbors=i)
knn.fit(X_train,y_train)
pred_i = knn.predict(X_test)
error_rate.append(np.mean(pred_i != y_test))

# Plotting the error rate to infer a value for K-neigbour
plt.figure(figsize=(10,6))
plt.plot(range(1,40),error_rate,color='blue', linestyle='dashed', marker='o',
markerfacecolor='red', markersize=10)
plt.title('Error Rate vs. K Value')
plt.xlabel('K')
plt.ylabel('Error Rate')
Plotted error rate

K-neighbour = 2 has the minimum error rate

Fitting on the Model with K=2

knn3 = KNeighborsRegressor(n_neighbors=2)

knn3.fit(X_train,y_train)
knn_pred = knn.predict(X_test)

print("RMSE of KNN --> ", knn3.score(X_test, y_test)*100)

Output: RMSE of KNN Model → 63.436252589277544

Analysis on the Model : “KNN can struggle in regression due to its reliance on nearby neighbors. It lacks flexibility in capturing complex relationships, leading to suboptimal predictions and higher RMSE values.”

2. Naive-based Model

“Naive Bayes, the ‘naive’ rockstar of ML. With its simple assumption of independence between features, it classifies data swiftly and efficiently. It’s like a magician pulling rabbits out of a hat, predicting with speed and ease. From spam filtering to sentiment analysis, Naive Bayes weaves its magic, making it a popular choice in many ML applications.”

from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import GaussianNB

Mb = MultinomialNB()
Bb = BernoulliNB()
Gb = GaussianNB()

Fitting the Model on the DataSet

Mb.fit(X_train,y_train)
y_pred = Mb.predict(X_test)
print("RMSE on MultinomialNB --> ", Mb.score(X_test, y_test)*100)

Output: RMSE on MultinomialNB → 0.3424657534246575

Bb.fit(X_train,y_train)
bb_pred2 = Bb.predict(X_test)
print("RMSE on BernoulliNB --> ", Bb.score(X_test, y_test)*100)

Output: RMSE on BernoulliNB → 1.36986301369863

Gb.fit(X_train,y_train)
y_pred3 = Gb.predict(X_test)
print("RMSE on GaussianNB --> ", Gb.score(X_test, y_test)*100)

Output: RMSE on GaussianNB → 1.36986301369863

Analysis on the Model : “Naive Bayes is ill-suited for regression as it assumes independence between features and has limited capability to capture complex relationships, resulting in a low RMSE in regression. Also it is used when you try to perform Sentimental Analysis.”

3. Support Vector Machine (SVM)

“Support Vector Machine (SVM) is a powerful machine learning model that maximizes the margin between classes, making it effective for both classification and regression tasks. It finds an optimal hyperplane to separate data points and can handle complex decision boundaries. SVM achieves remarkable performance by utilizing support vectors and kernel functions for efficient and accurate predictions.”

from sklearn import svm
clf = svm.SVC(decision_function_shape='ovo')

Fitting the Model

clf.fit(X_train,y_train)
svm_pred = clf.predict(X_test)

print("RMSE of SVM --> ", clf.score(X_test, y_test)*100)

Output: RMSE on SVM → 2.054794520547945

Analysis on the Model : “SVM: Strong in classification, weak in regression. Its focus on maximizing the margin makes it less effective for continuous prediction. SVM may struggle to capture complex relationships, leading to a low RMSE in regression tasks.”

4. Decision Tree

“Decision tree: Your AI guide through data’s wilderness. This model maps features to outcomes, making decisions branch by branch. It learns from data and splits its way to clarity, uncovering insights in a tree-like structure. With its intuitive and interpretable nature, the decision tree empowers us to navigate the complexities of data and make informed choices.”

from sklearn import tree
clf_tree = tree.DecisionTreeRegressor()

Fitting the Model

clf_tree.fit(X_train,y_train)
decision_pred = clf_tree.predict(X_test)

print("RMSE of Decision Tree --> ", clf_tree.score(X_test, y_test)*100)

Output: RMSE on SVM → 73.91510370312822

Analysis on the Model : “Branching out to low RMSE! Decision trees excel in regression with their ability to capture nonlinear relationships and handle both continuous and categorical features. By recursively partitioning data, decision trees create precise predictions, making them a powerful tool for minimizing RMSE and achieving accurate regression results.”

5. Random Forest

“Nature’s own ensemble of decision trees. This powerful machine learning model combines the wisdom of many trees to make accurate predictions. It handles high-dimensional data, avoids overfitting, and provides feature importance. With randomness and teamwork, the Random Forest model is a forest full of predictive power, conquering the wilderness of data”

from sklearn.ensemble import RandomForestRegressor
clf_random_tree = RandomForestRegressor(max_depth=25, random_state=0,n_estimators=1000)

In Hyper-parameter, max_depth=25 and n_estimators=1000 was estimated to be good for the model

Fitting the Model on DataSet

clf_random_tree.fit(X_train,y_train)

random_pred = clf_random_tree.predict(X_test)

print("RMSE of Random Forest Model --> ", clf_random_tree.score(X_test, y_test)*100)

Ouput: RMSE of Random Forest Model → 86.71510582796937

Analysis on the Model : “Unleash the power of Random Forest! This versatile ensemble model excels in regression tasks, delivering impressive RMSE scores. By combining multiple decision trees and aggregating their predictions, Random Forest captures complex relationships and reduces overfitting. It handles high-dimensional data, handles missing values, and provides feature importance insights. Experience the magic of Random Forest and achieve remarkable regression results with minimal hassle.”

6. Gradient Boosting

“Boost your predictions with Gradient Boosting! This powerful machine learning model combines weak learners into a strong one. It iteratively improves predictions by focusing on previous mistakes. Gradient Boosting handles various data types, handles complex interactions, and excels in regression and classification tasks. Get ready to elevate your models to new heights with Gradient Boosting!”

from sklearn.datasets import make_hastie_10_2
from sklearn.ensemble import GradientBoostingRegressor
clf_boosting = GradientBoostingRegressor(n_estimators=100, learning_rate=1.0, max_depth=1, random_state=0)

In Hyper-parameter, max_depth=1, n_estimators=1000, learning_rate= 1.0 was estimated to be good for the model

Fitting on the DataSet

clf_boosting.fit(X_train,y_train)
boosting_pred = clf_boosting.predict(X_test)

print("RMSE on Gradient Boosting--> ", clf_boosting.score(X_test, y_test)*100)

Ouput: RMSE of Gradient Boosting Model → 86.71510582796937

Analysis on the Model : “Experience exceptional regression with Gradient Boosting! This versatile model leverages the power of boosting to iteratively refine predictions and minimize the root mean squared error (RMSE). By effectively capturing complex relationships in data, Gradient Boosting excels in regression tasks, delivering accurate and reliable results. Say goodbye to high RMSE and embrace the power of Gradient Boosting for exceptional regression performance!”

also becoming the best fit

5. MODEL PERFORMANCE SUMMARY

import math

mse_values = [0.8640396610218306, 0.8705751147837917, 0.8118592767468824, -0.18955898672630567, 0.2003347520496762, 0.6470277623092952]

best_r2_score = -np.inf
best_mse = np.inf

for mse in mse_values:
r2_score = 1 - (mse / np.var(y_test))

if r2_score > best_r2_score:
best_r2_score = r2_score
best_mse = mse

print(f"Best R2 score: {best_r2_score:.3f} (corresponding to MSE: {best_mse:.3f})")

Best Model is Gradient Boosting with RMSE86.059%

6. SIMULATION OF FINAL SUBMISSION

Predicting the Final Submission using Gradient Boosting Algorithm

testing = clf_boosting.predict(X_test)
output = pd.DataFrame(testing).astype(int)

# Saving the Output in the form of .csv
output.columns = ['SalePrice']
Id_list = np.arange(1461,2920).astype(int)
Id = pd.DataFrame(Id_list)
Id.columns = ['Id']
submission = pd.concat([Id, output], axis = 1)
submission.to_csv('/content/drive/MyDrive/house-prices-advanced-regression-techniques/submission.csv', index = False, header = True)

About me

Thank you so much for reading my article! Hi, I’m Akhil Sharma, a IT&MI student from Cluster Innovation Center, University Of Delhi. If you have any questions, please don’t hesitate to contact me!

Email me at akhilsharma.off@gmail.com and feel free to connect me on LinkedIn!

follow me on Github : https://github.com/Akhil-Sharma30

follow me on Twitter : DevelopAkhil.twitter

Colab NoteBook: https://github.com/Akhil-Sharma30/House-Pricing-Analysis/blob/main/house_prices_Analysis.ipynb

--

--