[WEEK 7–ARTIFICIAL REAL ESTATE AGENT]

3 min readJan 14, 2019

Theme: Image Classification and House Price Estimation with Visual and Textual Features

Team Members: Gökay Atay, Ilkin Sevgi Isler, Mürüvet Gökçen, Zafer Cem Özcan

Our goal was to predict house prices with visual and textual features. In order to do that we’ve followed these steps until now:

1- First, we’ve classified our house data set which contains images from houses bedrooms, bathrooms, kitchens and frontal views.

2- After classifying, we’ve extracted features from our categorized pictures. We call these features as visual features.

3- And lastly, we used these visual features(luxury levels) as a new textual feature to our previous ones (number of bedrooms, number of bathrooms, area and zipcode) to predict the prices of houses.

As we discussed earlier, we use the location information, area and zip code, by taking averages of them. So we introduce our model these features in an appropriate way. This approach affects our result excessively.

Our motivation for this problem is that these visual features we extracted would support predicting a house price. Let’s look at our results whether our assumption is true or not.

First of all we’ve generated an equation which contains all the values we got from the previous part(bath_lux, bed_lux, frontal_lux, kitchen_lux). Using the linear regression we found out which room’s luxury level effects the price most.

So we got this equation;

df[‘predicted_lux’]=

(1354.7927*df[‘bath_lux’])+(6454.5789*df[‘bed_lux’])+(4017.0430*df[‘frontal_lux’])+(3874.6429*df[‘kitchen_lux’])

When we predict values with Random Forest, we tried all the combination of columns to get the highest accuracy.

As you can see, if we use only the location features(even if these 2 were the ones that has the highest correlation with price.) this model’d be so simple that it underfits.

When we add the number of rooms as you can see we got higher accuracy as we expected because it is an important feature to predict the house prices.

If we add the avgerage prices instead of number of rooms the accuracy increases by %1 as we expected. (like getting the averages of locations)

Finally, adding the predicted_lux column overfits our model that the accuracy drops by %1.

As you can see when we add new features as a predictor to our model, the accuracy drops even if each feature increases the r-squared value. This is because, it does not correlate well with the other features we’ve already provided. Because it overfits, it’d be better not to use these visual features for our model.

When all things considered, we observe that adding all features to our model did not increase our accuracy. We eliminate some of the features that is not increase our accuracy. Using only ‘avg_price_by_bed’, ’avg_price_by_bath’, ’avg_price_by_area’, ’avg_price_by_zipcode’ features provides us the model that fits our data the best.

To summarize we have experienced how a research was conducted and progressed, and more features does not mean better accuracy.

See you later in other projects. Take care!

[WEEK 7–ARTIFICIAL REAL ESTATE AGENT]

Written by Ilkin Sevgi Isler