[WEEK 3–ARTIFICIAL REAL ESTATE AGENT]

Mürüvet Gökçen
bbm406f18
Published in
3 min readDec 16, 2018

Theme: Image Classification and House Price Estimation with Visual and Textual Features

Team Members: Gökay Atay, Ilkin Sevgi Isler, Mürüvet Gökçen, Zafer Cem Özcan

As we have introduced you some of our works previous two weeks, we realize that we haven’t mentioned about our dataset expressly.

This is a qualitative dataset for houses prices that contains both pictures and textual features of these houses. Each house represented by four images for bedroom, bathroom, kitchen and frontal view of the house. Dataset contains 2140 photo of 535 houses. Also it contains a text file which contains textual information of each house. Each row in the file represents the id of the house in order. Number of bedrooms, number of bathrooms, area of the house, zip code of the house and price provided in textual features.

The distribution of the dataset by Zip Code

When we consider zip codes of the houses, we noticed all of them are located in different parts of the USA. And large majority belong to the California area. This will allow us to see how the price range of the region affects our results when we predict prices.

Because of we will incorporate the luxury levels of houses to our project, the dataset must contain all types of furnished houses. Here is some examples for poor and richly designed houses.

Examples of poor furnished houses
Examples of richly furnished houses

Here you can see distributions of the datas depending on the number of rooms. As you can see most of the houses have around 3 bedrooms and 2 bathrooms.

To see how each feature affects price we found all the correlations between each feature and price. As you see, from our columns ‘number of bedrooms’ and ‘number of bathrooms’ have the highest correlations. But when we think about pricing a house even it is small in size it can be so expensive depending on the location.
We realized that we can’t find a correlation between such a feature like zip code or area because it is a feature that we can also think as a string(locations). So we need to group the date by zip code and area than find average price for each zip code and area. So we can now calculate the correlation between ‘avg_price_by_zipcode’ and ‘avg_price_by_area’ columns with ‘price’ column.
Later on we are going to use some regression models to predict the price. We are going create linear regression models for each feature (‘number of bedrooms’, ‘number of bathrooms’, ‘avg_price_by_zipcode’, ’avg_price_by_area’) and a multiple linear regression by using each feature. Then we are going to use other models like decision tree and random forest if we can get higher accuracies.

We shared with you the relevance of our data with the project and details of dataset. See you next week with a bunch of progress!

Special thanks for the dataset to:

--

--