Get the most out of the tool by learning how to use it with real-life data
Quite at the heart of a Location Intelligence company like Geoblink lies a tool that enables us to digest all kinds of data from different sources: the so-called Geocoder, an internal tool that has the task of converting address strings into coordinates of latitude and longitude. This is critical in order to be able to place a given address in a map with a good level of precision.
Now, we should stress that this is actually a quite challenging task. Not always will the given addresses be provided in a clean, standardized manner, indeed, there are a number of hard cases that you see in the daily life of working with geospatial information. Just to give you a few examples:
- Addresses can be located in municipalities with names that occur several times across the country (Torrent is the name of both a city in the autonomous community of Valencia and a village in Catalonia)
- They can be located in tiny villages with barely more than 20 inhabitants or in development areas that have been constructed only a few years ago
- They could be using different languages (“Carrer d’Aragó” and “Calle de Aragón” actually refer to the same street in Barcelona in Catalan and Spanish, respectively)
- Finally, and most importantly, almost any data we receive contains a share of unclean information. Dirty addresses can be as easy-to-spot-and-fix as a typo (“Bibao” instead of “Bilbao”) or as hard-to-spot-and-impossible-to-restore as a cut of information (once we’ve been passed a list of addresses were the last character had been cut off, transforming “Avinguda Diagonal 132” in “Avinguda Diagonal 13”).
And of course, as we do not only process Spanish addresses but also French and British ones, we have to take into account that each country has its own subtleties and idiosyncrasies when it comes to address specifications.
As many data sources do not use coordinates but addresses as their primary geographic information, it was fundamental to have a tool for their conversion. The solution we came up with is a complex tool that combines several geocoding methods, among them a few we developed ourselves based on public and private sources (e. g., Cartociudad, the project of Spain’s National Center for Geographic Information or the Royal Mail’s information on British postal codes) and third-party geocoding methods (Google and Nominatim). Finally, we blended in our own machine learning-based NLP system for a cleaning of addresses. The resulting tool (that we will describe more thoroughly in a future post) allows us to geocode addresses in a fast, scalable and intelligent way.
After over a year of intense usage, we have found that the Google Geocoding service provides results of excellent quality. Nevertheless, working with real-life data, there were some things that caused us quite a headache until we figured them out, which is why we would like to share with you three things to avoid when using Google’s Geocoding API:
1) Strip your address of any information more granular than a house number. We often try to geocode postal addresses and in Spain this implies that they often include additional information such as the flat number (think for example “first floor, flat C” or in short “1ºC”, or unit numbers, in Spanish e. g. “Local 2”).
Interestingly enough, the Google Geocoding API seems often unable to process this kind of additional information and gives bad results if you include it. For example, if we call the service with the address (“C/” means “Calle” or street)
"C/ Arona nº 29, Local 2, Montequinto",
it will fail to understand the address and instead return the data for
"Avda. Condes de Ybarra, s/n, 41089 Dos Hermanas".
It is not evident why the service came up with an entirely different street name here. However, if we remove the part of the additional description (“Local 2”) and call the service with
"C/ Arona nº 29, Montequinto",
the address is correctly geolocated.
Even though we have seen some improvements in the last months concerning this kind of additional information, we learned from these results never to pass a completely untreated address to Google’s Geocoding API. When possible, you should always remove subpremise information such as flat number or door, “local” information in Spain and e. g. unit numbers for shopping malls in Great Britain.
2) Take great care with low precision or approximate results. While it sounds tempting to include these results to have at least an estimate of where the searched place is, it actually often indicates that the geocoding process failed.
Consider for example the following address which is a small variation of one that actually reached our Geocoder:
"C. Santo Domingo 17, 29600, Albarizas, Spain"
A human trying to geocode it would have her difficulties because this address does not seem to exist in Spain (most probably the municipality of Marbella is meant, of which 29600 is the postal code). Calling the Google Geocoding API for this address and reducing the scope to Spain returns us a result of the quality “approximate” with the magic coordinates (40.46366700000001, -3.74922), a sports center in the north of Madrid. This result is at least 500 km off Marbella and there is no Santo Domingo street nearby. Additionally, after we monitored for a while all the results of Google’s geocoder we figured out that the same result had been returned more than 10.000 times, a 3% of all the requests we made in that period. So how can this be explained?
Having a deeper look at the result we were able to figure out what was happening. The service was not able to geocode the address, and instead of telling us there were no results it fell back to geocode the smallest part it was able to understand — the last word of the string, “Spain”. Interestingly, the geometrical center of the country Spain lies according to Google in the aforementioned coordinates. The same was happening in the UK by the way, where the point lies in the middle of Scottish countryside. We noticed this bug when we geolocated a few thousand addresses of which several hundreds ended up in the same spot, the infamous sports centre in the north of Madrid.
We learned from these observations to always validate the type of the returned result — if the type of the returned result is not “street address” but “municipality” or even “country”, we check if this matches the desired accuracy and filter out if not.
3) The components parameter might not work as you expect it to. Google’s Geocoding API allows to pass not only the address string but also, separately, a number of parameters that allow to specify postal codes, municipality, region and country of the desired address. If you have those information separately, it seems intuitive to pass it to Google separately to save the geocoder some work instead of merging it all into one string that you pass as the address parameter. In fact, in some real-life example we’ve seen, we noticed this assumption may be wrong. Passing Google’s geocoder the address
"Calle Laparra, 5", components="country:es|locality:Benidorm"
"06176 La Parra, Badajoz, Spain"
which a municipality in Spain but not the desired address. Passing instead
"Calle Laparra, 5, Benidorm", components="country:es"
without components, however, returns the correct information for
"Calle la Parra, 5, 03501 Benidorm, Alicante".
Now, you might object that this is a difficult case as the address is misspelled (street name “Laparra” instead of “La Parra”). Nevertheless, in the real world we have to work with dirty data and one would intuitively expect that reducing the scope to the city Benidorm should make it easier for the algorithm to correct the typo. The reason why the result actually gets worse can be guessed after some research in the documentation:
“In the geocoder, component filtering enforces only postal_code and country restrictions”, while “[t]he [other] components may be used to influence results”.
Google Geocoding API — Developer Guide
As our example works with a city name (“locality”) and not with a postal code, the search ends up being fully determined by the address string while the locality specification does not visibly influence the results.
We learned from these examples that you should only make usage of components option for the two parameters “country” and “postal code” which actually influence the results in an intuitive way. If we are only given the city name, we concatenate it to the address string to achieve improvements in the performance of Google’s Geocoding API.
After over a year of using Google’s Geocoder, we can say that we got a good insight into the quality of the results it produces. While we appreciate the high rate of successful geocodings, even for dirty address data, the three pitfalls above illustrate that the geocoding of addresses is a very complex problem and how even a tech giant like Google doesn’t get it all right.