# House Prices Analysis: Advanced Regression Techniques

# Table of Contents:

- HOUSE PROBLEM
- EXPLORATORY DATA ANALYSIS
- DATA PREPARATION
- BUILDING MODELS
- MODEL PERFORMANCE SUMMARY
- SIMULATION OF FINAL SUBMISSION

## 1. HOUSE PROBLEM

*Description*

Ask a home buyer to describe their dream house, and they probably won’t begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition’s dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

In this Completion, participants are presented with a comprehensive dataset related to house prices. The dataset includes various features and attributes of residential properties, such as the number of bedrooms, square footage, location, and other relevant factors.

The dataset was carefully curated to provide a diverse representation of houses from different regions. It aims to capture real-world scenarios and challenges faced in predicting house prices accurately. The dataset also contains additional information such as sales dates and prices, allowing participants to analyze trends and patterns over time.

Participants are encouraged to leverage advanced regression techniques to develop predictive algorithms that can accurately estimate house prices based on the given features. The objective is to create models that can generalize well and effectively capture the complex relationships between the independent variables and the target variable, which is the sale price of the house.

The competition aims to showcase the ability to predict house prices reliably using the provided dataset. Successful models would shed light on the factors that significantly influence house prices, providing valuable insights for real estate professionals, home buyers, and sellers alike. Additionally, the competition fosters the exchange of innovative approaches and techniques in the field of regression analysis and housing market research.

## 2. EXPLORATORY DATA ANALYSIS

*Data Description*

As you can see in the image the **House Prices: Advanced Regression Techniques **dataset is used which contains 4 files

- train.csv (which has your training data).
- test.csv (which has your testing data).
- data_description.txt (which contains description of the attributes of the data like which different categories does a particular attribute possess).
- sample_submission.csv (its a sample submission file to let you know that your predicted file should have following format).

After extracting the file here’s the train data

## Analysis on Data

`df.isnull().sum().sort_values(ascending=False)`

Using the **isnull**() function we can see that maximum entries of the column ** PoolQC** is empty. So we can drop this column from the main dataset

: Miscellaneous feature not covered in other categories. In this category also maximum entries are Null.*MiscFeature*column does not improve the accuracy of the model. so we can remove it in the feature selection*Electrical*

Note: ** Bedroom** and

**are not a parameter in the dataset they are just abbribiation in the data.**

*kitchen*** Alley** column is also empty and also does not improve the accuracy of the model

**After tried to find** →

## 1. Categorical values having one value that will not be helpfull

`droped_columns_2=[]`

for col , num in zip( df.astype('object').nunique().index,df.astype('object').nunique().values):

if num ==1:

droped_columns_2.append(col)

droped_columns_2

Output [‘Alley’, ‘PoolQC’, ‘Fence’, ‘MiscFeature’]

## 2. Deleting the duplicate Columns

No such value found luckily 😎

`df.duplicated().sum()`

# lets try exploring numerical features

So in this we focused on **multicollinearity **as this will lead to loss in training a data that is not much needed in the model

*Used **HeatMap**, for features selection*

High correlation Features

[‘YearBuilt’,’GarageYrBlt’] [‘1stFlrSF’,’TotalBsmtSF’] [‘GarageCars’ ,’GarageArea’] [‘GrLivArea’,’TotRmsAbvGrd’]

`df.drop(columns=['GarageYrBlt','1stFlrSF','GarageArea','TotRmsAbvGrd'],inplace=True)`

## Catching the Outlier Phase 🧐

Outliers are data points that significantly deviate from the overall pattern of a dataset. They are observations that are distant from other observations and can have a disproportionate impact on statistical analyses.

These can affect the accuracy of any model as it affect the

varienceof themean,leading tobiased results

These are features having

‘TotalBsmtSF’,’GrLivArea’a lot of outlier 🤧🤧

## Lets Catch them😎

`df.drop(columns='Id',inplace= True)`

mask1=df['TotalBsmtSF']<2050

mask2=df['TotalBsmtSF']>100

mask3=df['GrLivArea']<2800

mask4=df['GarageCars']<3.8

mask5=df['OverallQual']>1.8

DF=df[mask1&mask2 &mask3&mask4&mask5]

# Lets Examine the Skewness of Data

Skewness is a statistical measure that quantifies the asymmetry of a probability distribution on dataset. It provides information about the shape of the distribution and the extent to which it deviates from a symmetric, bell-shaped curve.

Skewed data affect the performance of any model so lets see the

distributionof the features and fix the skewed data

`Note: `

Data is symmetrical: skewness is between -0.5 and 0.5

Data is slightly skewed: skewness is between -1 and -0.5 or 0.5 and 1

Data is highly skewed: skewness is less than -1 or greater than 1.

`# select columns with skew()>1 or <1`

sk=[]

for i in df.drop(columns=[ 'YearBuilt','YearRemodAdd']).select_dtypes('number').columns:

if ((df[i].skew()>1) or (df[i].skew()<1)):

sk.append(i)

# update values of skew data from x to log(x)

np.seterr(divide = 'ignore')

sk_=pd.DataFrame(np.select([DF[sk]==0, DF[sk] > 0, DF[sk] < 0], [0, np.log(DF[sk]), np.log(DF[sk])]),columns=sk).set_index(DF.index)

df_skew=DF.drop(columns=sk).set_index(DF.index)

df_skew=pd.concat([df_skew,sk_],axis=1)

X_train_skew=df_skew.drop(columns='SalePrice')

# select columns with skew()>1 or <1

sk_t=[]

for i in test.drop(columns=[ 'YearBuilt','YearRemodAdd']).select_dtypes('number').columns:

if ((test[i].skew()>1) or (test[i].skew()<1)):

sk_t.append(i)

sk_t=pd.DataFrame(np.select([test[sk_t]==0, test[sk_t] > 0, test[sk_t] < 0], [0, np.log(test[sk_t]), np.log(test[sk_t])]),columns=sk_t).set_index(test.index)

df_skew_t=test.drop(columns=sk_t).set_index(test.index)

df_skew_t=pd.concat([df_skew_t,sk_t],axis=1)

X_test_skew=df_skew_t.reindex(columns=X_train_skew.columns)

But, I did not applied the skewed data in the real data as i want to train model with this **data** and the **real data** (i did it with my self and it give more accurecy with new updates) I want to if it will do better than the real data if not we will continue as things is not changing😋.

# 3. DATA PREPARATION

Description

*Impution of Missing Values*

*Impution of Missing Values*

Dealed with the Categorical features missing values separately.

`# start with Categorical Features`

from sklearn.impute import SimpleImputer

imp_cat = SimpleImputer(strategy="most_frequent")

X_Cat= df_final.select_dtypes('object')

X_data_cat = imp_cat.fit_transform(X_Cat)

Dealed with Numerical Data separately.

`# Numerical Features`

from sklearn.impute import SimpleImputer

imp_num = SimpleImputer()

X_num=df_final.select_dtypes('number')

X_data_num = imp_num.fit_transform(X_num)

## One-Hot Encoding

“Hot cats, cool dogs! One-hot encoding: making categorical data ML-friendly. No bias, just distinct columns. Meow, bark, woof! Let’s convert categories into numerical paw-someness. No more cat-astrophes in algorithms! Embrace the power of one-hot encoding and unleash purr-fect predictions!”

`from sklearn.preprocessing import OneHotEncoder`

ohe = OneHotEncoder(drop='first',sparse_output=False)

df_ohe = ohe.fit_transform(X_Cat)

#At the End the concatenated all the data using numpy.concatenate()

X_Data=np.concatenate([df_ohe,X_data_num],axis=1)

# 4. BUILDING MODELS

A head-Up implemented all the ML Models (I knew) on this data and gave their analysis upon how fit that are on the data and at the end predicted the best Model for

Root Mean Squared Error(RMSE)possible for this dataset.🙂🙂

`from sklearn.metrics import mean_squared_error,r2_score`

# Splitting the Data into train and test data

X_train,X_test,y_train,y_test = train_test_split(X_Data,df['SalePrice'],

test_size=0.2)

List of Implemented Models

**KNN Algorithm****Naive-based Model****Support Vector Machine (SVM)****Decision Tree****Random Forest****Gradient Boosting**

## 1. KNN Algorithm

“KNN, the friendly neighbor of machine learning. It’s a simple yet powerful algorithm. KNN stands for k-nearest neighbors. It classifies data based on its closest neighbors in feature space. Whether predicting genres or identifying anomalies, KNN’s got your back. Just count the neighbors, find consensus, and make accurate predictions. It’s like having a helpful neighbor next door in the world of ML!”

`from sklearn.neighbors import KNeighborsRegressor`

knn = KNeighborsRegressor()

## Hyper-Parameter Tunning

`error_rate = []`

[True, False, ... ]

# Will take some time

for i in range(1,40):

knn = KNeighborsRegressor(n_neighbors=i)

knn.fit(X_train,y_train)

pred_i = knn.predict(X_test)

error_rate.append(np.mean(pred_i != y_test))

# Plotting the error rate to infer a value for K-neigbour

plt.figure(figsize=(10,6))

plt.plot(range(1,40),error_rate,color='blue', linestyle='dashed', marker='o',

markerfacecolor='red', markersize=10)

plt.title('Error Rate vs. K Value')

plt.xlabel('K')

plt.ylabel('Error Rate')

K-neighbour = 2 has the minimum error rate

## Fitting on the Model with K=2

`knn3 = KNeighborsRegressor(n_neighbors=2)`

knn3.fit(X_train,y_train)

knn_pred = knn.predict(X_test)

print("RMSE of KNN --> ", knn3.score(X_test, y_test)*100)

Output:→ 63.436252589277544RMSE of KNN Model

**Analysis on the Model : **“KNN can struggle in regression due to its reliance on nearby neighbors. It lacks flexibility in capturing complex relationships, leading to suboptimal predictions and higher RMSE values.”

## 2. **Naive-based Model**

“Naive Bayes, the ‘naive’ rockstar of ML. With its simple assumption of independence between features, it classifies data swiftly and efficiently. It’s like a magician pulling rabbits out of a hat, predicting with speed and ease. From spam filtering to sentiment analysis, Naive Bayes weaves its magic, making it a popular choice in many ML applications.”

`from sklearn.naive_bayes import MultinomialNB`

from sklearn.naive_bayes import BernoulliNB

from sklearn.naive_bayes import GaussianNB

Mb = MultinomialNB()

Bb = BernoulliNB()

Gb = GaussianNB()

## Fitting the Model on the DataSet

`Mb.fit(X_train,y_train)`

y_pred = Mb.predict(X_test)

print("RMSE on MultinomialNB --> ", Mb.score(X_test, y_test)*100)

Output:RMSEon MultinomialNB → 0.3424657534246575

`Bb.fit(X_train,y_train)`

bb_pred2 = Bb.predict(X_test)

print("RMSE on BernoulliNB --> ", Bb.score(X_test, y_test)*100)

Output:RMSEon BernoulliNB → 1.36986301369863

`Gb.fit(X_train,y_train)`

y_pred3 = Gb.predict(X_test)

print("RMSE on GaussianNB --> ", Gb.score(X_test, y_test)*100)

Output:RMSEon GaussianNB → 1.36986301369863

**Analysis on the Model : “**Naive Bayes is ill-suited for regression as it assumes independence between features and has limited capability to capture complex relationships, resulting in a low RMSE in regression. Also it is used when you try to perform *Sentimental Analysis*.”

## 3. **Support Vector Machine (SVM)**

“Support Vector Machine (SVM) is a powerful machine learning model that maximizes the margin between classes, making it effective for both classification and regression tasks. It finds an optimal hyperplane to separate data points and can handle complex decision boundaries. SVM achieves remarkable performance by utilizing support vectors and kernel functions for efficient and accurate predictions.”

`from sklearn import svm`

clf = svm.SVC(decision_function_shape='ovo')

## Fitting the Model

`clf.fit(X_train,y_train)`

svm_pred = clf.predict(X_test)

print("RMSE of SVM --> ", clf.score(X_test, y_test)*100)

Output:RMSEon SVM → 2.054794520547945

**Analysis on the Model : **“SVM: Strong in classification, weak in regression. Its focus on maximizing the margin makes it less effective for continuous prediction. SVM may struggle to capture complex relationships, leading to a low RMSE in regression tasks.”

## 4. **Decision Tree**

“Decision tree: Your AI guide through data’s wilderness. This model maps features to outcomes, making decisions branch by branch. It learns from data and splits its way to clarity, uncovering insights in a tree-like structure. With its intuitive and interpretable nature, the decision tree empowers us to navigate the complexities of data and make informed choices.”

`from sklearn import tree`

clf_tree = tree.DecisionTreeRegressor()

## Fitting the Model

`clf_tree.fit(X_train,y_train)`

decision_pred = clf_tree.predict(X_test)

print("RMSE of Decision Tree --> ", clf_tree.score(X_test, y_test)*100)

Output:RMSEon SVM → 73.91510370312822

**Analysis on the Model : **“Branching out to low RMSE! Decision trees excel in regression with their ability to capture nonlinear relationships and handle both continuous and categorical features. By recursively partitioning data, decision trees create precise predictions, making them a powerful tool for minimizing RMSE and achieving accurate regression results.”

## 5. **Random Forest**

“Nature’s own ensemble of decision trees. This powerful machine learning model combines the wisdom of many trees to make accurate predictions. It handles high-dimensional data, avoids overfitting, and provides feature importance. With randomness and teamwork, the Random Forest model is a forest full of predictive power, conquering the wilderness of data”

`from sklearn.ensemble import RandomForestRegressor`

clf_random_tree = RandomForestRegressor(max_depth=25, random_state=0,n_estimators=1000)

In Hyper-parameter,

max_depth=25 andn_estimators=1000 was estimated to be good for the model

## Fitting the Model on DataSet

`clf_random_tree.fit(X_train,y_train)`

random_pred = clf_random_tree.predict(X_test)

print("RMSE of Random Forest Model --> ", clf_random_tree.score(X_test, y_test)*100)

Ouput: RMSEof Random Forest Model → 86.71510582796937

**Analysis on the Model : **“Unleash the power of Random Forest! This versatile ensemble model excels in regression tasks, delivering impressive RMSE scores. By combining multiple decision trees and aggregating their predictions, Random Forest captures complex relationships and reduces overfitting. It handles high-dimensional data, handles missing values, and provides feature importance insights. Experience the magic of Random Forest and achieve remarkable regression results with minimal hassle.”

## 6. **Gradient Boosting**

“Boost your predictions with Gradient Boosting! This powerful machine learning model combines weak learners into a strong one. It iteratively improves predictions by focusing on previous mistakes. Gradient Boosting handles various data types, handles complex interactions, and excels in regression and classification tasks. Get ready to elevate your models to new heights with Gradient Boosting!”

`from sklearn.datasets import make_hastie_10_2`

from sklearn.ensemble import GradientBoostingRegressor

clf_boosting = GradientBoostingRegressor(n_estimators=100, learning_rate=1.0, max_depth=1, random_state=0)

In Hyper-parameter,

max_depth=1,n_estimators=1000,learning_rate= 1.0 was estimated to be good for the model

## Fitting on the DataSet

`clf_boosting.fit(X_train,y_train)`

boosting_pred = clf_boosting.predict(X_test)

print("RMSE on Gradient Boosting--> ", clf_boosting.score(X_test, y_test)*100)

Ouput: RMSEof Gradient Boosting Model → 86.71510582796937

**Analysis on the Model : “**Experience exceptional regression with Gradient Boosting! This versatile model leverages the power of boosting to iteratively refine predictions and minimize the root mean squared error (RMSE). By effectively capturing complex relationships in data, Gradient Boosting excels in regression tasks, delivering accurate and reliable results. Say goodbye to high RMSE and embrace the power of Gradient Boosting for exceptional regression performance!”

also becoming the best fit

# 5. MODEL PERFORMANCE SUMMARY

`import math`

mse_values = [0.8640396610218306, 0.8705751147837917, 0.8118592767468824, -0.18955898672630567, 0.2003347520496762, 0.6470277623092952]

best_r2_score = -np.inf

best_mse = np.inf

for mse in mse_values:

r2_score = 1 - (mse / np.var(y_test))

if r2_score > best_r2_score:

best_r2_score = r2_score

best_mse = mse

print(f"Best R2 score: {best_r2_score:.3f} (corresponding to MSE: {best_mse:.3f})")

Best Model is Gradient BoostingwithRMSE→86.059%

# 6. SIMULATION OF FINAL SUBMISSION

Predicting the Final Submission using Gradient Boosting Algorithm

`testing = clf_boosting.predict(X_test)`

output = pd.DataFrame(testing).astype(int)

# Saving the Output in the form of .csv

output.columns = ['SalePrice']

Id_list = np.arange(1461,2920).astype(int)

Id = pd.DataFrame(Id_list)

Id.columns = ['Id']

submission = pd.concat([Id, output], axis = 1)

submission.to_csv('/content/drive/MyDrive/house-prices-advanced-regression-techniques/submission.csv', index = False, header = True)

# About me

Thank you so much for reading my article! Hi, I’m Akhil Sharma, a IT&MI student from Cluster Innovation Center, University Of Delhi. If you have any questions, please don’t hesitate to contact me!

Email me at

akhilsharma.off@gmail.comand feel free to connect me on

follow me on **Github** : https://github.com/Akhil-Sharma30

follow me on **Twitter** : DevelopAkhil.twitter

**Colab NoteBook:** https://github.com/Akhil-Sharma30/House-Pricing-Analysis/blob/main/house_prices_Analysis.ipynb