Buying a house in the Netherlands (in a nerdy way)

Cintia Bruscantini
Towards Data Science
12 min readOct 7, 2019

If you are like me, you might get overwhelm when having to make big decisions such as buying a house. In such cases, I always like to go for a data driven approach, that will help me find an optimum solution. This involves two steps. First, we need to gather as much data as we can. And second, we need to define a metric for success.

Gathering data of housing prices in Netherlands requires some effort. I got the housing prices from funda.nl. A caveat is that the asking prices in the website are not the prices to which the houses were actually sold. Asking price is also not obviously bias towards under or over selling price, so I’m not able to correct for it. This is what generally happens when dealing with real data, it is always far from perfect.

Defining a metric for success is personal and subjective. I consider a house to be a good option if: 1. the asking price is cheap; and/or, 2. the potential rent I can get from it is high compared to the investment.

  1. In order to know if the asking price is cheap, I used the asking prices of houses being sold in the website, to build a machine learning model that predicts the asking price of a house, given its characteristics. Then, I can use that model to evaluate what the asking price of a house should be, and if the actual asking price is below the price predicted by the model, then I’ll consider it to be a good deal. (Note: The definition of a good deal in this scenario is within the context of the dataset itself, and not in absolute terms of the market value; so, if for example the website has all prices twice as what they should’ve been according to the market, I will still get some good deals by comparing houses among each other).
  2. To know what would be the rent price of a house which is being sold, I used the data of houses offered for rent in the website, and built a machine learning model that predicts rent price given the house characteristics. Then, I can use that model to evaluate what the rent price of a house being sold should be. If the ratio of asking price over predicted rent is low, then it means I can rent the house and the payback period would be short.

A final caveat, all the analysis made here is for static data, there is no price trend forecast involved.

The Data

From funda.nl, I have data of 2,150 houses and apartments being sold in Amsterdam, Zaandam and Diemen, in July 2019; and 1,046 houses and apartments offered for rent.

For each house, I have the following data (yeah, the website is in dutch, you can use google translator):

['Aangeboden sinds', 'Aantal badkamers', 'Aantal kamers', 'Aantal woonlagen', 'Aanvaarding', 'Achtertuin', 'Badkamervoorzieningen', 'Balkon/dakterras', 'Bijdrage VvE', 'Bouwjaar', 'Capaciteit', 'Cv-ketel', 'Eigendomssituatie', 'Energielabel', 'Externe bergruimte', 'Gebouwgebonden buitenruimte', 'Gelegen op', 'Inhoud', 'Inschrijving KvK', 'Isolatie', 'Jaarlijkse vergadering', 'Lasten', 'Ligging', 'Ligging tuin', 'Onderhoudsplan', 'Oppervlakte', 'Opstalverzekering', 'Overige inpandige ruimte', 'Periodieke bijdrage', 'Reservefonds aanwezig', 'Schuur/berging', 'Servicekosten', 'Soort appartement', 'Soort bouw', 'Soort dak', 'Soort garage', 'Soort parkeergelegenheid', 'Soort woonhuis', 'Specifiek', 'Status', 'Tuin', 'Verwarming', 'Voorlopig energielabel', 'Voorzieningen', 'Vraagprijs', 'Warm water', 'Wonen', 'address']

I also created some extra features using the following code:

def create_cols(df):
df[‘zip_code’]=df[‘address’].str.extract(pat=’([0–9]{4} [A-Z]{2})’)
df[‘zip_code’]=df[‘zip_code’].str.replace(‘ ‘, ‘’, regex=False)
df[‘zip_code_number’]=df[‘zip_code’].str.extract(pat=’([0–9]{4})[A-Z]{2}’).fillna(0).astype(int)
df[‘price’]=df[‘Vraagprijs’].str.extract(pat=’([0–9]{0,3}.?[0–9]{3}.[0–9]{3})’)
df[‘price’]=df[‘price’].str.replace(‘.’, ‘’, regex=False).astype(float)
df[‘nr_bedrooms’] = df[‘Aantal kamers’].str.extract(pat=’([0–9]) slaapkamer’).fillna(0).astype(int)
df[‘nr_rooms’] = df[‘Aantal kamers’].str.extract(pat=’([0–9]) kamer’).fillna(0).astype(int)
df[‘nr_floors’] = df[‘Aantal woonlagen’].str.extract(pat=’([0–9]) woonla’).fillna(0).astype(int)
df[‘nr_bathrooms’] = df[‘Aantal badkamers’].str.extract(pat=’([0–9]+) badkamer’).fillna(0).astype(int)
df[‘nr_toilet’] = df[‘Aantal badkamers’].str.extract(pat=’([0–9]+) aparte? toilet’).fillna(0).astype(int)
df[‘construction_year’]=df[‘Bouwjaar’].str.extract(pat=’([0–9]{4})’).astype(float)
df[‘cubic_space’] = df[‘Inhoud’].str.extract(pat=’([0–9]+) m’).fillna(0).astype(float)
df[‘external_storage_space’] = df[‘Externe bergruimte’].str.extract(pat=’([0–9]+) m’).fillna(0).astype(float)
df[‘outdoor_space’]=df[‘Gebouwgebonden buitenruimte’].str.extract(pat=’([0–9]+) m’).fillna(0).astype(float)
df[‘living_space’]=df[‘Wonen’].str.extract(pat=’([0–9]+) m’).fillna(0).astype(float)
df[‘montly_expenses’]=df[‘Bijdrage VvE’].str.extract(pat=’([0–9]+) per maand’).fillna(0).astype(float)
df[‘other_indoor_space’]=df[‘Overige inpandige ruimte’].str.extract(pat=’([0–9]+) m’).fillna(0).astype(float)
df[‘dont_have_frontyard’]=df[‘Achtertuin’].str.extract(pat=’(voortuin)’).isna()
df[‘not_straat’]=df[‘address’].str.extract(pat=’(straat)’).isna()
df[‘not_gracht’]=df[‘address’].str.extract(pat=’(gracht)’).isna()
df[‘not_plein’]=df[‘address’].str.extract(pat=’(plein)’).isna()
df[‘price_per_living_sqm’] = df[‘price’]/df[‘living_space’]
df[‘is_house’]=df[‘Soort appartement’].isnull()
df = df[df[‘price’].notna()]
df = df[df[‘living_space’]>0]
df = df[df[‘living_space’]<600]
df = df[df[‘price’]<4500000]
df[‘dont_have_backyard’] = df[‘Achtertuin’].isna()
df[‘backyard_size’] = df[‘Achtertuin’].str.extract(pat=’([0–9]+) m²’).fillna(0).astype(float)
df[‘has_garage’]=df[‘Soort garage’].isna()
df[‘total_area’] = df[‘outdoor_space’]+df[‘external_storage_space’]+df[‘living_space’]+df[‘other_indoor_space’]
df[‘address_nozip’]=df[‘address’].str.extract(pat=’^(.+)[0–9]{4} [A-Z]{2}’)
df[‘address_zip’]= df[‘address_nozip’] + ‘ ‘ + df[‘zip_code’]
df[‘parcela’]= df[‘Oppervlakte’].str.extract(pat=’([0–9]+) m’).fillna(0).astype(float)
df[‘price_per_parcela_sqm’] = df[‘price’]/df[‘parcela’]
return df

Data Exploration

Importing packages needed:

import pandas as pd
import re
import numpy as np
import seaborn as sns
import scipy.stats as stats
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV, cross_val_score, KFold
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.preprocessing import LabelEncoder
from xgboost import XGBRegressor

Loading the data:

df = pd.read_csv('data_funda.csv',sep = ',', encoding = 'utf-16')
df = create_cols(df)

Let’s see what is the relation between price, living space and construction year. Because of the skewed distribution of price and living space, I also created the log transformed values of those variables and named them price_log1p and living_space_log1p respectively (see “Features distribution” section).

# Price vs. Living space
p_cor = df[['living_space','price']].corr(method ='pearson') ['price'].living_space
df.plot.scatter(x='living_space',y='price',c='construction_year',colormap='viridis', figsize=[12,8], vmin=1900, vmax=2000,label="corr:"+str(round(p_cor,4)))
plt.legend()
p_cor = df[['living_space_log1p','price_log1p']].corr(method ='pearson') ['price_log1p'].living_space_log1p
df.plot.scatter(x='living_space_log1p',y='price_log1p',c='construction_year',colormap='viridis', figsize=[12,8], vmin=1900, vmax=2000,label="corr:"+str(round(p_cor,4)))
plt.legend()

You can see clearly that the bigger the house, the higher the price (duh!), but what is also a little surprising (if you know nothing about Amsterdam) is that very old houses are more expensive than newer ones for the same square meters. This is because of location!. Amsterdam downtown is very expensive and houses there are quite old. You can also see from the plot that houses built around the year 2000 tend to be bigger than those built in 1980.

And how does the price change with number of bedrooms?

# Total bedrooms vs. Price
sns.boxplot(x=df['nr_bedrooms'], y=df['price'])
plt.show()
sns.boxplot(x=df['nr_bedrooms'], y=df['price_per_living_sqm'])
plt.show()

The higher the number of rooms, the higher the asking price (left plot), most likely because the house is bigger. However, if we see how the price per square meter changes with the number of rooms (right plot), it seems to be rather flat, except that the median is lower for houses with 3, 4 and 5 bedrooms.

Location, location, location!

Housing price is very dependent on the location of the house. Let’s take a look at how the price distribution changes with zip code.

df[['price','zip_code_number']].boxplot(by='zip_code_number', figsize=[25,8], rot=45)
plt.ylabel('Price')

From the plot you can see there are some zip codes where houses have high median prices and standard deviation (e.g. downtown zip codes such as 1071, 1077, 1017), and some where the prices are consistently low (e.g. 1102, 1103, 1104, which are in the area of Bijlmermeer).

Actually, I find more interesting to see the distribution of price per square meter versus zip code.

ax=df[['price_per_living_sqm','zip_code_number']].boxplot(by='zip_code_number', figsize=[25,8], rot=45)
plt.ylabel('Price per sqm')
ax.set_ylim([2000,12500])

Without any knowledge in machine learning, one can use the previous plot to see which neighbourhoods to explore and go for house viewing, if we had a fix budget and a minimum square meter requirement. So, if we were looking for a house with at least 100 sqm, and want to spend less than 400k euros, we should focus on zip codes such as 1024, 1060, 1067, 1069, >1102. Of course, you might get a house (outlier) in zip code 1056 that is below the budget, but be prepare to have to spend a few extra bucks in renovation :) .

Talking about renovations… how does the price per square meter vary with the construction year? Let’s make a plot.

df_filt = df[(df['construction_year']>1900)&(df['construction_year']<2019)]
df_filt['construction_year_10'] = df_filt['construction_year']/10
df_filt['construction_year_10'] = df_filt['construction_year_10'].apply(np.floor)
df_filt['construction_year'] = df_filt['construction_year_10']*10
data = pd.concat([df_filt['price_per_living_sqm'], df_filt['construction_year']], axis=1)
f, ax = plt.subplots(figsize=(20, 8))
fig = sns.boxplot(x='construction_year', y="price_per_living_sqm", data=data)
fig.axis(ymin=1000, ymax=11000)
plt.xticks(rotation=45)

As we saw before, very old houses have higher price per square meter than newer ones. This is because those old houses are located in very expensive zip codes. Thus, from 1900, where price per square meter is the highest, until 1970, the prices drop, and then from that year on, prices increase again.

In order to have the full picture of how each pair of features are correlated, let’s take a look at the correlation matrix.

plt.figure(figsize=(20,12))
sns.heatmap(df.drop(columns='index').corr(), annot=False, square=True)

There are too many things going on there, but it gives us an idea of what variables are positive correlated (such as living space with number of bedrooms, rooms, floors, bathrooms, etc.) and which ones are negative correlated (e.g. being a house and not having a backyard or a frontyard, or having monthly expenses).

If we focus the attention only on the top 10 variables that have highest correlation with price, then the correlation matrix looks like this:

corrmat = df.corr()
cols = corrmat.nlargest(10, 'price')['price'].index
hm = sns.heatmap(np.corrcoef(df[cols].values.T), cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()

We can see that price is highly correlated with all variables that directly or indirectly are related to the house size.

A final quality check we can do is to see the percentage of missing values of each feature.

df_na = (df.isnull().sum() / len(df)) * 100
df_na = df_na.drop(df_na[df_na == 0].index).sort_values(ascending=False)[:40]
missing_data = pd.DataFrame({'Missing Ratio' :df_na})
plt.figure(figsize=(15, 9))
sns.barplot(x=df_na.index, y=df_na)
plt.xticks(rotation='80')
plt.ylabel('Percent of missing values', fontsize=15)

Some of these variables have missing values because the data was not provided by the real estate agent/owner (e.g. energy label), and for some other missing value actually means lack of it (e.g. garage or garden -tuin-).

Features distribution

In order to build an ML model for predicting housing prices, let’s take a closer look at the features histograms. I’ll plot here just 2 examples: monthly expenses [euro] and living space [sqm].

for col in df.describe().columns.values:
if col<>'price_per_parcela_sqm':
axs = df.hist(column=col,bins=50)

From the left plot we can see that some houses (mostly apartments) have monthly expenses (called Bijdrage VvE in dutch) - and on average tends to be around 100 euros - and some others don’t have it at all.

From the right plot, you can see that the mode of the living_space distribution is around 100 sqm. The distribution is not normal, but highly right-skewed; and so is the price distribution. Let’s take a closer look at the price distribution.

for col in df.describe().columns.values:
try:
sns.distplot(df[col], label="Skewness: {:.3f}".format(df[col].skew()))
plt.title(col+' Distribution')
plt.ylabel('Frequency')
plt.xticks(rotation=45)
plt.legend()
plt.show()
qq = stats.probplot(df[col], plot=plt)
plt.show()
except:
pass

Such high skewness is not desirable for features and labels input to an ML model. Therefore, let’s proceed to log transform variables that have high skewed distributions.

# Assign numeric features by excluding non numeric features
numeric = df.dtypes[df.dtypes != 'object'].index
# Display the skewness of each column and sort the values in descending order
skewness = df[numeric].apply(lambda x: x.skew()).sort_values(ascending=False)
# Create a dataframe and show 5 most skewed features
sk_df = pd.DataFrame(skewness,columns=['skewness'])
sk_df['skw'] = abs(sk_df)
sk_df.sort_values('skw',ascending=False).drop('skw',axis=1).head()
# As a general rule of thumb, skewness with an absolute value less than 0.5 is considered as a acceptable range of skewness for normal distribution of data
skw_feature = skewness[abs(skewness) > 0.5].index
# Transform skewed features to normal distribution by taking log(1 + input)
for col in skw_feature:
df[col+"_log1p"] = np.log1p(df[col])
# let's check the result of the transformation
sns.distplot(df['price_log1p'],label="Skewness: {:.3f}".format(df['price_log1p'].skew()))
plt.legend()
plt.title('Price Log(price + 1) transform Distribution')
plt.ylabel('Frequency')
plt.figure()
qq = stats.probplot(df['price_log1p'], plot=plt)
plt.show()

We can see that after log transforming the price values, the distribution is closer to a normal distribution (although yet not perfect).

Price prediction ML model

So much for the data exploration, let’s build an ML model!.

First let’s define the features that we are going to use.

#Label encoding
feat_enc = ['zip_code_number']
# Features
feat_cols = ['nr_bedrooms','nr_rooms','nr_floors','nr_bathrooms','nr_toilet','zip_code_number_le','is_house','has_garage','dont_have_backyard','not_straat','not_gracht','not_plein','has_frontyard','backyard_size_log1p','living_space_log1p','cubic_space_log1p','outdoor_space_log1p','total_area_log1p','montly_expenses_log1p','parcela_log1p','construction_year']

We will train an XGBoost regression model, do some small grid search for hyperparameter tunning, and we will use cross-validation.

df_filt = df[df['price']<700000]#Missing values, impute with mode
for fr in ['construction_year']:
df_filt[fr].fillna(df_filt[fr].mode()[0], inplace=True)
#Label encoding
for feat in feat_enc:
le = LabelEncoder()
le.fit(df_filt[feat])
df_filt[feat+'_le'] = le.transform(df_filt[feat])
label='price_log1p'x = df_filt[feat_cols]
y = df_filt[label]
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.1, random_state=1)
print(X_train.shape, X_test.shape,y_train.shape)
kfold = KFold(n_splits=5, random_state= 0, shuffle = True)def performance_metric(y_true, y_predict):
""" Calculates and returns the performance score between
true (y_true) and predicted (y_predict) values based on the metric chosen. """
score = r2_score(y_true, y_predict)
return score
XGB = XGBRegressor()xg_param_grid = {
'n_estimators' :[900,400,1200],
'learning_rate': [0.04, 0.06, 0.08],
'max_depth': [3,6,8],
'min_child_weight':[0.2],
'gamma': [0,1],
'subsample':[0.8],
'colsample_bytree':[1]
}

gsXGB = GridSearchCV(XGB,param_grid = xg_param_grid, cv=kfold, scoring="neg_mean_squared_error", n_jobs= -1, verbose = 1)
gsXGB.fit(X_train,y_train)
XGB_best = gsXGB.best_estimator_
print(gsXGB.best_params_)
y_hat_xgb = np.expm1(gsXGB.predict(X_test))

Let’s check the model’s performance.

r2_train = performance_metric(np.expm1(y_train), np.expm1(gsXGB.predict(X_train)))
r2_test = performance_metric(np.expm1(y_test), y_hat_xgb)
print "R2 train: ", r2_train
print "R2 test: ", r2_test
plt.scatter(np.expm1(y_train), np.expm1(gsXGB.predict(X_train)),label='R2:'+str(round(r2_train,4)))
plt.title('Train')
plt.xlabel('Price')
plt.ylabel('Price predicted')
plt.plot([100000,700000], [100000,700000], 'k-', alpha=0.75)
plt.legend()
plt.show()
plt.scatter(np.expm1(y_test), y_hat_xgb,label='R2:'+str(round(r2_test,4)))
plt.plot([100000,700000], [100000,700000], 'k-', alpha=0.75)
plt.title('Test')
plt.xlabel('Price')
plt.ylabel('Price predicted')
plt.legend()
plt.show()

Correlation is high, which is good, and from the plots you can see some vertical patterns in the asking prices. This is due to real estate agents/owners rounding prices.

It is also useful to take a look at feature importance, to understand which features are helpful to predicting house prices.

from xgboost import plot_importanceplot_importance(XGB_best)
plt.show()

As we have seen already in the data exploration section, the housing prices are highly correlated to the location of the house, the construction year, and its size.

Now, let’s use the model to predict housing prices, and look for houses for which the asking price is lower than the predicted price.

Xtt = pd.concat([X_train,X_test],axis=0)
ypred=pd.DataFrame([np.expm1(gsXGB.predict(Xtt)),Xtt.index]).transpose()
ypred.columns = ['pred','idx']
ypred.set_index('idx',inplace=True)
ytt = ypred.join(df_filt)
ytt['ratio'] = ytt['price']/ytt['pred']
ytt['difference'] = ytt['price'] - ytt['pred']

Now I would like to see houses that have more than 100 sqm indoor, more than 5 sqm of outdoor space, and for which the difference between asking and predicted price is highly negative:

x=ytt[['ratio','pred','price','outdoor_space','dont_have_backyard','dont_have_frontyard','living_space','nr_floors','difference','href']].sort_values(by='ratio')
print x[ (x['outdoor_space']>5)& (x['living_space']>100) ].head(10)

Let’s take a look at one example:

ratio      pred     price  outdoor_space living_space   difference                                                                                                           
0.814140 368487 300000 16 116 -68487
href: https://www.funda.nl/koop/zaandam/huis-40278208-lindenlaan-1/?navigateSource=resultlist

Here, the difference between predicted price and asking price is -68K euros. I checked in huispedia.nl, and according to them, it also seems that 300K euros for that house is a good deal (although they predict values between 371K and 397K euros). Of course, this is the cold and heartless way of making such a big decision, but you still need to click on the link, check the photos, and see if you don’t mind the orange walls and the slanted ceiling.

Rent prediction

In a similar fashion, I used the renting prices obtained from funda.nl, to build an ML model that predicts rent price given house features. I’m not going to describe the exact methodology, but I used the rent model to estimate rent value of the 2,150 houses and apartments that are being sold in the website (for which we don’t have rent prices).

I used the rent prediction to estimate what the payback period would be for each house if I were to rent it after buying it (I called it ratio_sell_rent_year). To calculate it, I divided the price by the rent prediction, and by 12 to get it on a yearly units.

Finally, I plotted the ratio_sell_rent_year by zip code to see which areas are convenient in terms of return on investment.

ytt['ratio_sell_rent'] = ytt['price']/ytt['rent_prediction']
ytt['ratio_sell_rent_year'] = ytt['ratio_sell_rent'] /12
ax=ytt[['ratio_sell_rent_year','zip_code_number']].boxplot(by='zip_code_number', figsize=[25,8], rot=45)
ax.set_ylim([0,50])

Zip codes such as 1019/25, 1032/6 and 1102/8 seem to have the lower payback period, averaging 15 years.

And similarly as with the asking price ML model, I used the rent model to get houses that have more than 100 sqm indoor area, more than 5 sqm of outdoor space, and have a low ratio_sell_rent_year:

ratio_sell_rent_year  ratio_sell_rent  rent_prediction   price  outdoor_space  living_space 
7.932343 95.188121 3571.874268 340000.0 46.0 166.0
href: https://www.funda.nl/koop/amsterdam/huis-40067181-moestuinlaan-12/?navigateSource=resultlist

I don’t know what the rent price of that house is or will be, but I found a very similar house next to it here. The rent price is 7,000 euros for a furnished and ~50 sqm bigger house. I actually went to see that house from outside, looked like a house I would have loved to live in, but the house was already sold out when we contacted the makelaar (real estate agent in dutch). Sadly, instead of fancy ML, it seems that moving fast is the best way to get a house in Amsterdam 😅.

Useful Links

--

--

Towards Data Science
Towards Data Science

Published in Towards Data Science

Your home for data science and AI. The world’s leading publication for data science, data analytics, data engineering, machine learning, and artificial intelligence professionals.

Cintia Bruscantini
Cintia Bruscantini

Written by Cintia Bruscantini

I’m a Data Scientist, electronic engineer, with a PhD in Engineering, working on dynamic pricing, and with research background on remote sensing.

Responses (9)