[WEEK IV] Prediction Of Real Estate Price

Batuhan Ündar

Published in

bbm406f18

6 min readDec 23, 2018

Team Members: Ali Batuhan ÜNDAR, Enes Koçak, Muhammed İkbal Arslan

Sup, lads and gents. I am back to pop the bubble that is real estate.

Less of a “pop” more of a “boom”. **Source:** Lethal Weapon

JK. That is not within our power. Sadly… But let’s talk about prices. And how we were going valuate houses.

Yeah, what’s up with that?

Fine, at least we have numbers now. If you read the previous blog (he he, read…) you know that we tried shallow models lastly. I will be talking about more of them this week (not because we did not do anything concrete this week, I swear). So this week, it will be short. First, let’s talk about how did we achieve such high scores.

But I hate numbers…

Shush.

The method we followed was trial and error. Initially we dump the whole data in model and stir. Remember that?

How original. **source:** https://xkcd.com/1838/

Of course, the model slapped back. Scores were awful. The best accuracy were KNN’s 20% accuracy. Worst one was, how to say, hard to interpret (raw score was -89.0, so -890% ?).

So how do improve that, thing? First, we look at data (second weeks blog). We already selected our features (wasn’t a vast list anyway) so we now look at the houses itself. We already knew there were some houses with “impressive qualities” (11 bathrooms? What is this, a bathhouse?) so secondly, we filtered some of them.

Third, there was issue of place data. Our data has the format of:

City/Neighborhood/Street

How do you even teach this to a neural network? You enumerate it. Thanks to the magic of SQL (database is your friend) we get the unique adresses then assign a number to them. But of course there is an issue. If we enumerate the whole thing, including street, that would give us about 1.2K unique combination. Out of 5K data. That would be meaningless since there are so little duplicates. Then as a trial we removed street during enumeration. That tone the unique combinations down to 200ish places.

Ok we added another feature to our model. How is the results?

KNN accuracy: 10%

What in the oblivion is that? Location is an important feature, how did this decrease our score? Ooh, I see. We did enumerate them but adresses were sorted alphabetically. Perhaps that’s the issue. Let’s try sorting adresses by average prices then enumerate them.

KNN accuracy: ~40%

There we go. But it’s still low. How did we even achieve that (from previous blog) 93%?

Well that’s… That’s an interesting story. Before each training, we split dataset into test and training sets. Before that we always shuffle our data. Each shuffle causes the scores vary greatly. How greatly? Well 40 to 90% kind of greatly. How did this even happened? How RNG gods are interfering with our lives?

Here is the theory. Our dataset is pretty unbalanced. Some places have higher data count than the others. For example Ankara has about 2K samples while Kars has, well, 1. So what happens when a sample from Kars wanders into test split? Scores goes down. What happens all test data is from Ankara? If the test set is sufficiently small(like 500 sample small) scores go through the roof.

Me, who scraped data for 4 days. Source: Best cry ever

To test this theory we only used samples from Ankara as dataset. That lowered overall score a bit but results were still acceptable.

KNN accuracy: 60–70%
Linear regression accuracy: ~70%
SVM accuracy: GOD KNOWS WHAT…

Yes, linear regression did not change no matter how many thimes we shuffled the data. That’s the improvement we seek. At this point there are two ways for us:

Improve dataset using a single place.
Improve dataset using set of places while making sure set does not get unbalanced (impressively difficult)
Accept the fact that no matter what we do dataset will be unbalanced and give in. Find a replacement dataset.

If you ask me first option (considering a lot of job needs to be done) makes more sense. But if we want to use image data too that’s were things get complicated.

Spiderman, Spiderman, does whatever a spider can…

Data scraping takes too much time. Especially if we want to scrape lots of image with it. Websites who sells their data, like real estate agencies, do not like web crawlers on their pages. They sometimes put blockages to prevent data scraping. Even if they don’t, aggresive scraping will trigger their DDOS security. That’s why one needs to be “delicate” when doing the dirty deed. That’s why there need to be a lot of delay between each content download(like 5 to 60 seconds of delay) so you can continously gather data without getting HTTP 429 all over the response.

<insert witty joke> **source:**http://xkcdsw.com/2274

While we are debating on it there is another issue that lowered our scores. Do you remember that we enumerated our location data? Well that’s a one way to do it. But here’s the catch; we enumerated them linearly. Each place is ordered then assigned a number between 1 and len(unique_place_count). But of course data isn’t that simple. Distribution of average prices depending on place isn’t a linear distribution. It is nonlinear and probably very complex. Our solution can be called naive. One more thing to work on.

Ook, that was quite long. What was the image thing you were talking about?

this Q&A thing is getting ridiculous. **Source:** bbdish.blogspot.com

Each house, along with the information about it, includes photos taken from inside and outside (alongside of huge watermark right in middle). There is average of 10 image per house and we intend to use these as another feature for price estimation.

How you might ask? Well, we are not sure either. This week is all about discussions and reports so we haven’t tried anything yet (yes, I admit it. Hope you are happy). But idea is to combine CNN with neural network. We want to use CNN for regression somehow but we are advised to look into another thing called “relational reasoning”. I haven’t checked it out yet (thank you assignment 3, thank you) but I can see that it is very complicated topic. We have some ideas about CNN though. We will be busy with it next week.

Also there is problem of creating training dataset from raw images. Images wasn’t tagged so we do not know which image is from which room so we used a pre-trained model called “places365”. So far we classified 2000ish images into “kitchen, bathroom, living room, bedroom”. But of course accuracy is not perfect and not all houses have images of all the rooms I mentioned above.

Ok this is getting very long so I will be closing here. There is a lot of things I want to talk about but then this would be a novel. Thanks for reading and sorry for the long post. Here is your potato:

[WEEK IV] Prediction Of Real Estate Price

Yeah, what’s up with that?

But I hate numbers…

Ook, that was quite long. What was the image thing you were talking about?

Written by Batuhan Ündar