Using Machine Learning at Geoblink to improve the quality of our database
Or, why addresses are complicated
One of the major problems we are dealing with at Geoblink is spotting the duplicated information (point of interest or any other location) on our map. Turns out, we do have something in common with Google Maps…
Just like Google Maps, we want to keep our map up-to-date with the latest information. However, due to the variety of sources of data we use, we sometimes encounter that the same physical point is represented on our map by two or more markers. One of the reasons behind that is how creative people often get when it comes to formatting an address: addresses coming from different data sources might be formatted very differently, so sometimes it becomes extremely difficult to tell whether two different addresses represent the same physical point or not.
Let’s take a look to a real-world example from our Spanish database: “avenida santa barbara, 59” and “Centro Comercial La Ventana- 59 Av. Sta Barbara” represent the same physical store. If not detected as being the same store, it will appear as two different stores in our app. And the problem gets much harder for stores located on the corner of two streets or in a city in a bilingual region (e.g Barcelona in Catalan and Spanish, San Sebastian in Basque, …) or on a road.
Besides, working hard on standardising the addresses we obtain from our data sources, at Geoblink we have decided to use Machine Learning to detect duplicated landmarks in our database. This isn’t supposed to be an in-depth tutorial on how to use Machine Learning, but just a brief explanation on how it can help solving problems.
In the Machine Learning jargon, the problem can be framed as a classification supervised learning problem, meaning that we will use labelled past data to train a model that helps us answer the specific question: given a new dataset of stores and addresses we have collected, which of them already exist in the database (duplicates) and which don’t (i.e. should be added to our database)?
We have proceeded as follows: in a first preprocessing stage the algorithm extracts from the database the most probable duplicate candidate for each of the potentially new addresses we want to add to the database, thus creating couples of an ‘old’ address with a ‘new’ address. The features used here and in the following steps to characterize each of the couples are variables such as the similarity of addresses, the similarity between cities and the distance between the two points of the couple.
After that, we apply our Machine Learning model to these couples ‘old’-’new’ in order to classify them into 3 groups:
- duplicates : for points that nearly certainly exist already in the database
- new : for points that nearly certainly don’t exist in the database
- to check : when it is tricky to give a clear answer based only on the variables.
The model has been trained on a huge manually labelled dataset of couples representing duplicates or non-duplicates and stored for reuse, and the idea is to keep enlarging it over time.
Is this working well and what results do we obtain ?
We tried several machine learning models and in the end we settled for a Random Forest, which made only 7 mistakes out of 22000 predictions made on our test set.
While the results are certainly promising, there’s still a lot of work to do in order for the model to “learn” about more complicated duplicate cases. In any case, this research exercise will help us greatly to automatize a little more the maintenance to maximise the quality of our database.
By Yann Bignier & Jordi Giner Baldó