Preisatlas : transparent real estate prices in Germany, part 1

part one: the problem, problem decomposition, and geo

Yuri Veremeyenko
Homeday
6 min readJul 25, 2019

--

This is the first part of the article that explains the problem and all things geo; continue to part two to read about prices, machine learning, API development, and the summary.

TLDR

We are Homeday and we’re in the real estate business.

We are working on digitizing this very traditional and conservative market, one piece at a time.

With our Preisatlas (English: Price Explorer), we’re bringing transparency into the real estate prices all over Germany.

It’s been super fun to build, and we’re happy that lots of people use it on a daily basis.

Check it out (hint: click around on the map)

And here’s how it works in just 4 minutes:

If you’re interested in how it works and how it was developed, keep reading!

The problem

Have you ever rented an apartment? I’m pretty sure you have. Have you ever wondered if the price you were paying was adequate and fair?

What about moving into a new city and looking for an apartment there?

You could spend a lot of time browsing listings, studying the city map, its districts and boroughs, its public transport routes, to finally have an idea about where you would like to live, and where you could afford to live.

It must be way easier than that. It must be faster than that.

After all, we’re in the 21st century with lots of computing power, data, and a bunch of tools at our disposal.

It must be a matter of a few clicks.

But wait, let’s dive a bit deeper.

What if you’re interested in how prices change over time — the trends?

You could be a homeowner looking to sell a property, or an investor looking for the best return, or just looking to buy a home for yourself.

How do you get answers to questions like:

  • is this the right time to sell my property?
  • is this the right time to buy a property?
  • where have the prices fallen low enough to invest?
  • where have the prices risen up high enough?

We didn’t find any online tool on the German market that could help answer those questions.

Decomposing the problem

Okay, so we have a problem: there is no easy way to access prices and price trends for a specific location.

Let’s establish some terms:

  • property. There are a number of different property types in real estate, so let’s keep them as simple as we can with:
  1. apartment: a typical apartment in a building block or a house;
  2. house: single family house, terrace house, bungalow, semi-detached house etc.
  • marketing type. We will consider the obvious pair:
  1. rent;
  2. sell.
  • location. In real estate, location influences everything. Since we want to be precise, we will refer to location in geo terms:
  1. latitude (lat);
  2. longitude (lng).
  • price trends. This includes both the current price and historical trends, at the given location.

So we’d like to display price trends for the given property and marketing type at the location specified by the user.

At this point, we can already sketch a high-level overview of the solution:

Since our users are humans, they will want to input addresses, not coordinates. So we will have a “geo” component or service that is able to translate addresses to coordinates.

After that, we will pass the coordinates — as lat and lng — to the price engine, which is another component/service that will produce price trends.

Doesn’t look complicated, right? So let’s dive into details a bit.

Building the geo

The geo domain proved to be somewhat complicated — primarily because a typical developer nowadays doesn’t touch geometries, projections, shapes, and geo extensions, but also because geo data changes pretty often and it’s not easy to establish the source of truth for it.

Getting deep into the geo domain is a topic for a separate article, and here we’ll just go over the findings relevant to Preisatlas.

For Preisatlas, we had the following requirements:

  • Geocoding. We needed to turn addresses that users type into coordinates.
  • prices for zip codes, districts, and cities. We wanted to be able to show prices for Berlin, or prices for Ehrenfeld in Cologne, or just in some zip code.
  • quality of life. We wanted to have some general measure of how nice it is to live in a place: how far are the shops, how good is the public transport connection, how noisy it is, if there are schools and kindergartens nearby, etc.

Geocoding

A quick Google search produces a bunch of geocoding services. We have tried a few, and took a detailed look at the following two:

One big advantage of Nominatim is that you can host it yourself and avoid paying Google. However, Nominatim is based on OpenStreetMap — a project that distributes free geographic data for the world — which unfortunately means it’s not always accurate or up-to-date, and that can lead to confusion and frustration for the user.

So we ended up using Google, which is not without its quirks, but they seem to be pretty stable and repeatable, so we could find workarounds.

The main takeaway from our geocoding experience was that the result was super sensitive to the input format:

  • 123 Langestr, Mitte, Berlin will geocode correctly;
  • 123 Langestr Berlin Mitte will not geocode and will just return coordinates for Berlin (or sometimes Berlin-Mitte);
  • 123 Langestr Berlin (Mitte) can confuse the geocoding algorithm enough to not return any coordinates at all.

So we have implemented some heuristic input normalization which works for our use case.

If you’d like to learn more, or gauge how complicated the normalization problem is, take a look at libpostal — a C library to normalize addresses.

Hint: Just don’t try libpostal in production before checking your memory limits, as loading libpostal can easily consume a couple GBs of RAM.

Address components: zip codes, districts, cities

The geocoding process above usually returns latitude and longitude for the address, but it also returns address components. These are important for us, since we would like to use them to query the price engine for prices at those locations.

Now, let’s leave zip codes aside. In Germany, their format is 5 digits, and they have been defined, assigned, and remain relatively stable.

What about districts?

Well, for one, their names are not unique. For a lot of people, Kreuzberg will automatically mean Kreuzberg, Berlin, but there’s also Kreuzberg in Euskirchen and Ahrweiler.

Secondly, the districts change borders, they get divided or combined together to produce new districts, or they get renamed, or simply cease to exist.

Our Kreuzberg, if we talk about the one in Berlin, is now actually officially called Friedrichshain-Kreuzberg, as it has been united (at least administratively) with the neighbouring Friedrichshain.

And let’s not forget that you can call it Berlin-Kreuzberg, that’s perfectly valid, too.

Cities are a bit easier but are also ambiguous: Frankfurt is usually Frankfurt am Main but can as well mean Frankfurt (Oder).

We ended up having our own database of districts, cities and zip codes, primarily so that we could identify each of them uniquely — but also because we wanted to have their geo shapes.

Quality of life

Since we were going to show prices on the map anyway, and at this point it was clear we were going to deal with zip codes, districts, and cities and their shapes, we have also added quality of life, which ranges from 1 (lowest) to 5 (highest) and is basically a summary of how well (on average) one could live in that location.

This is done on the “block” (German: Wohnblock, English: residential block or housing block) level, as districts and zip codes can be large and the quality of life can vary from one location inside the district to another.

The topic itself goes quite deep into geo and other related domains, so we won’t go into details here.

End of part one

Congratulations, you’ve made it :)

We have covered the problem, a high-level solution, and the many faces of geo; read on to prices, machine learning, and APIs!

--

--

Yuri Veremeyenko
Homeday

Engineering Manager@Vinted, previously @Homeday. I like hiring and developing teams, Ruby, JS, Elixir, devops, and playing guitar.