Advanced House Price Prediction Kaggle Competition

Kalp Panwala
Analytics Vidhya
Published in
8 min readJun 1, 2020

Everyone reading this post might have heard about Kaggle, having wide range of Datasets and Competitions with great prizes. I was also new to Kaggle competitions and didn't have any experience working with the same, so i thought to give it a try and so i headed to Kaggle and tried hands on https://www.kaggle.com/c/house-prices-advanced-regression-techniques

Firstly, I want to let you know this is my first medium post so forgive me for any mistakes. If you are unable to understand some steps or have any issues regarding the code, I will provide my mail id at the bottom of post you can surely contact me on mail.

Here’s the GitHub link of the code

I have used Google colab you can use any ide like Jupyter Notebook.

DataSet

Above is the dataset used in this post

As you can see in the image the House Prices: Advanced Regression Techniques dataset is used which contains 4 files

  1. train.csv (which has your training data).
  2. test.csv (which has your testing data).
  3. data_description.txt (which contains description of the attributes of the data like which different categories does a particular attribute possess).
  4. sample_submission.csv (its a sample submission file to let you know that your predicted file should have following format).

After zipping the file here’s the train data

There are 80 columns in train data and 79 columns in test data. We need to predict Sale Price using regression techniques and submit the predicted values in sample_submission.csv and upload it on kaggle.

For solving the competition I found 3 stages:

  1. Data Preprocessing.
  2. Feature Selection (Actually Feature Selection comes under Data Preprocessing part but just to follow the steps to perform I am mentioning it as a separate step).
  3. Algorithm Selection for Regression.

Data Preprocessing

So let start with Data Preprocessing, for that we need to know about the data and its types. We can get info regarding our data through .info() function.

Also we need to know how many null values are present in our dataset so i used below code and there were 19 columns in train data and 33 columns in test data having null values.

## Count of top 20 Fields of train data having null values
train_data.isnull().sum().sort_values(ascending=False).iloc[:20]
## Count of top 35 Fields of test data having null values
test_data.isnull().sum().sort_values(ascending=False).iloc[:35]

Now for null values i follow some rules based on my experiences:

  1. If you have large data then you can probably delete the rows having null values but remember a side note that if some class to predict is occuring quite less, then instead of deleting that row try to replace the null values as defined in step 4. This step is not mostly used as it can result in removal of some important data.
  2. If you see null values attribute (column) wise and if some attributes have more proportion of null values than actual data then you can decide a ratio and remove those attributes having null values to total data ratio more than a decided ratio (like if you decide 0.7 as ratio and you have 1000 records and out of that 800 are having null values than you can drop the attribute). (refer to 3.ipynb below)
  3. You can also make a classifier to predict those null values, KNN is widely used technique. But it is a very tiresome process.
  4. You might be thinking how to replace null values and for that we look into the data types of the attributes, if the attribute is int or float then we can replace the null values with the mean of the attribute .
dataset[col].fillna(dataset[col].mean(),inplace=True)

else if it is object than we can replace it with most frequently used. Now this can be done using many ways below i am listing 2 ways. Both give same output.

i) from sklearn_pandas.CategoricalImputer

ii) by replacing with mode (refer to 4.ipynb below)

dataset[col].fillna(dataset[col].mode()[0],inplace=True)
snippet for 2nd point
snippet for 4rd point

We can have total count of null values using below code.

## Top 5 Fields of test data having null values
test_data.isnull().sum().sort_values(ascending=False).iloc[:5]
## Top 5 Fields of test data having null values
train_data.isnull().sum().sort_values(ascending=False).iloc[:5]

Then i found that there are many attributes who have 3 classes in training data while more or less than 3 classes in testing data. This could lead to imbalanced classification as we will eventually separate the class into one hot encoder form. category_onehot_multcols function is used to make one hot encoder form of all the categorical variables present in the dataset provided as argument. Also Dummy Variable Trap is also taken care of .

total= pd.concat([train_data.drop(['SalePrice'],axis=1),test_data]
,axis=0)

## func to convert into one hot encoder form
def category_onehot_multcols(multcolumns):
df_final=total
i=0
for fields in multcolumns:

print(fields)
df1=pd.get_dummies(total[fields],drop_first=True)

total.drop([fields],axis=1,inplace=True)
if i==0:
df_final=df1.copy()
else:
df_final=pd.concat([df_final,df1],axis=1)
i=i+1

df_final=pd.concat([total,df_final],axis=1)
return df_final
## passing our total Dataframe for one hot encoder
total=category_onehot_multcols(column)
save_cols=total.columns

Now as all categorical values are into int form we can perform the most important step which you should always remember is to scale the data using StandardScaler from from sklearn.preprocessing import StandardScaler. It helps in converting data between -1 to 1 and i tried predicting with and without StandardScaler and found quite difference in the results, so i suggest you to use it mandatorily in regression problems.

from sklearn.preprocessing import StandardScaler
x = total.values
x = StandardScaler().fit_transform(x)
x=pd.DataFrame(x,columns=save_cols)
## naming columns to ease the operations
cols=[]
for i in range(0,233):
name = "col"+str(i)
cols.append(str(name))
x=pd.DataFrame(x,columns=cols)# splitting scaled total data into train and test data
train = x.iloc[:1460]
test = x.iloc[1460:]
# getting predicting values i.e SalePrice into "y" and scaling it ..
# separately
y = train_data['SalePrice'].values
sc=StandardScaler()
y = pd.DataFrame(sc.fit_transform(y.reshape(-1,1)))
y.columns=['SalePrice']
train = pd.concat([train,y],axis=1)

Feature Selection

I used correlation matrix to see how much a attribute is correlated with the SalePrice (which is to be predicted).

Correlation coefficients are used in statistics to measure how strong a relationship is between two variables. There are several types of correlation coefficient: Pearson’s correlation (also called Pearson’s R) is a correlation coefficient commonly used in linear regression. There’s is a limitation in using correlation coefficients as they are only for continuous values and not for classes so we actually created one hot encoder form which converted it into continuous form so that we can apply Correlation coefficients on it.

Correlation coefficient formulas are used to find how strong a relationship is between data. The formulas return a value between -1 and 1, where:

  • 1 indicates a strong positive relationship.
  • -1 indicates a strong negative relationship.
  • A result of zero indicates no relationship at all.

You can get the correlation coefficients by the following code :

train.corr().reset_index()[['index','SalePrice']]

I have taken any attribute whose coefficient is greater than +0.15 and smaller than -0.15. This was experimented like from 0.6 to 0.15 as it showed improved results. Also as you will be able to see i stored the columns that are more correlated with SalePrice into columns which i will be using further.

Now is the part of selecting Regression Algorithm for our problem statement and you are almost done till this step as data preprocessing is a very tiresome process and according to some survey data scientists spend their 90% of time to preprocess the data.

The lists of Algorithms i used:

1. Artificial Neural Networks (got 0.21339 error score around 4800 rank with correlation used and through basic data preprocessing and with 1000 epochs). And after playing with parameters i.e 760 epochs, i concluded Ann with score of 0.18528 which was quite good. I noticed adding dropouts in every layer won’t work as it’s error was increased so i added only one Dropout layer which was the 1st layer.

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import LeakyReLU,PReLU,ELU
from tensorflow.keras.layers import Dropout
# Initialising the ANN
classifier = Sequential()
# Adding dropout layer
classifier.add(Dropout(0.2))
# Adding the input layer and the first hidden layer
classifier.add(Dense(50, kernel_initializer = 'he_uniform', activation='relu',input_dim = 146))
# Adding the second hidden layer
classifier.add(Dense(25, kernel_initializer = 'he_uniform', activation='relu'))
# Adding the third hidden layer
classifier.add(Dense(50, kernel_initializer = 'he_uniform', activation='relu'))
# Adding the output layer
classifier.add(Dense(1, kernel_initializer = 'he_uniform', use_bias=True))
# Compiling the ANN
classifier.compile(loss=root_mean_squared_error, optimizer='Adamax')
# Fitting the ANN to the Training set
model_history=classifier.fit(train.values, y.values,validation_split=0.20, batch_size = 10, epochs = 760)

2. I used Gradient Boosting Algorithm next but and got a score of 0.16394 around 3400 rank but after changing parameters the score remained almost same .

from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
regressor = GradientBoostingRegressor(
max_depth=10,
n_estimators=500,
learning_rate=1.0
)
regressor.fit(X_train, y_train)errors = [root_mean_squared_error(y_train, y_pred) for y_pred in regressor.staged_predict(X_train)]
best_n_estimators = np.argmin(errors)
best_regressor = GradientBoostingRegressor(
max_depth=2,
n_estimators=best_n_estimators,
learning_rate=1.0
)
best_regressor.fit(X_train, y_train)
y_pred = best_regressor.predict(X_test)

3. Next I used Random Forest Regressor and i achieved a new score of 0.16282 which was slight improvement.

from sklearn.ensemble import RandomForestRegressor 

# create regressor object
regressor = RandomForestRegressor(n_estimators = 500, random_state = 0)

# fit the regressor with x and y data
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)

4. Next I tried SVR (Support Vector Regressor) and PCA with 2,5,10,20 components but in vain they were worst performing models .

from sklearn.decomposition import PCApca = PCA(n_components=20)principalComponents = pca.fit_transform(train[columns])Df = pd.DataFrame(data = principalComponents, columns = [‘pc1’, ‘pc2’,’pc3',’pc4',’pc5',’pc6',’pc7',’pc8',’pc9',’pc10',’pc11', ‘pc12’,’pc13',’pc14',’pc15',’pc16',’pc17',’pc18',’pc19',’pc20'])print(‘Explained variation per principal component: {}’.format(pca.explained_variance_ratio_))

5. Next i tried XGBoost Regression and i achieved score of 0.14847 with 500 estimators and it was a great leap from Random Forest Regressor. I was having around 2800 score which was also a huge leap that from previous one.

import xgboost as xgb

xg_reg = xgb.XGBRegressor(objective ='reg:linear', colsample_bytree = 0.3, learning_rate = 0.1,max_depth = 5, alpha = 10, n_estimators = 1000)

xg_reg.fit(train[columns],y)

y_pred = xg_reg.predict(test[columns])

After around 20 submissions i got score of 0.13933 and rank of 2465 with 500 estimators and correlation value cutoff as (+/-) 0.15 . I tried various values of estimators and possibly you can also try out different ones. Also you can try out many other regression techniques like Polynomial Regression, Logistic regression, Linear regression, etc.

Lastly we need to submit the submission having same format of sample_submission.csv and here’s the code for it.

## as we have transformed the prediction price between -1 to 1 we 
## need to inversely transform it back to original values.
pred=pd.DataFrame(sc.inverse_transform(y_pred))
sub_df=pd.read_csv('sample_submission.csv')
datasets=pd.concat([sub_df['Id'],pred],axis=1)
datasets.columns=['Id','SalePrice']

datasets.isnull()
datasets.to_csv('sample_submission.csv',index=False)

datasets.head()

Thanks much for reading the post and if you appreciate the efforts behind the post then surely give it a clap.

In case of query or room for improvement here is my mail id: kpanwala33@gmail.com you can mail me I will reply you asap.

follow me on Twitter : https://twitter.com/PanwalaKalp

connect with me on LinkedIn : https://www.linkedin.com/in/kalp-panwala-72284018a

follow me on Github : https://github.com/kpanwala

--

--