Solving Data Consistency Issues in Our Places Platform

Published in

Incognia Tech Blog

11 min readJun 10, 2020

Location data is one of the core drivers of the work we do at Incognia. It provides us with a way to infer contextual information in the real world, which can in turn be used to model the spreading of a virus or build engagement solutions.

However, as rich as the location data might be, many other applications require more information on what a pair of latitude and longitude coordinates might represent: is this location inside a bar? What is the street address pertaining to this location? What’s near there after all?

Figure 1 — There’s much more information beyond latitude and longitude pairs!

In order to provide this kind of information, we built a database of Place entities, also commonly referred to as Points of Interest (POIs) in scientific literature. The planning of our places infrastructure began when our products were still at their infancy, and at the time we realized that:

We needed an in-house places database, since paying for 3rd-party API usages would not scale well and we would not be able to enrich and validate the data ourselves;
Good coverage and reliable information were mandatory. Additionally, every place must’ve had at least latitude and longitude coordinates and a name;
We needed this infrastructure to behave like a platform, and, as we expected all other teams to become internal clients, data should be available in a fast and easy way.

This article firstly describes our platform as of the beginning of Q1 2019, and expounds data consistency issues we found during the years. We also present our current solutions for these issues, and how we’re planning to improve them.

For ease of understanding, we define a place entity for this article as follows:
A place entity is a direct representation of a physical location in the real world which has well-defined boundaries, a purpose, a considerable size, and is both atemporal and a source of context.
Thus, it is useful to think that, in our database, places are most often either public sites or establishments. This definitions excludes, for instance, building floors, stretches of sand on a beach, and landmarks with negligible boundaries, e.g. a statue.

Foundation: an Event-sourced Architecture

With these goals in mind, we set out to develop what would become an event based platform which dealt with place records’ ingestion from both the Web and internal sources, exposed them by APIs and web interfaces, and was processed on a daily basis by several Spark distributed applications. All the while, it provided full visibility of what was going on.

Figure 2 — Our Places architecture circa Q1 2019

To arrive at this architecture, we went through several iterations which gave us golden learning opportunities. While these are not the focus of this blog post, we highly encourage taking a look at one of our previous presentations about the topic.

As Figure 2 depicts, all data we ingested from the web was being sent to Apache Kafka. Additionally, users were able to manipulate our places database by utilizing our web interface (Places Web). Every action in our DB thus boiled down to a set of INSERT, UPDATE, DELETE or MERGE Kafka records, sent to the places-db-commit-log and then processed by our DB Writer service.

Let’s say, for instance, that we ingested a new place entity called Shopping Plaza: if we already had a place entity representing this real-world place in our DB, we would just UPDATE its information, since its phone number, address, timetable, or other attributes could have changed. If, however, there was no equivalent place entity in the DB, we would just INSERT it. Similarly, manual user operations executed through our web interface would generate operations pertaining to the aforementioned set.

By using Incognia’s Kafka structure we also gained, with little to no effort, automatic backup of every action issued to our db. However, we still needed a way to visualize those actions in real time for debugging and analysis. In order to do that, we utilized Incognia’s Kafka Elasticsearch Injector, publishing every action event in a dedicated Elasticsearch cluster with Kibana.

Another important aspect of our architecture was the usage of three different data sources. First of all, we note that our PostgreSQL DB was and still remains our canonical data storage. Furthermore, while it may seem odd to do that, we found out that our clients required certain query types for which there were better tools.

For full text queries, for example, Elasticsearch offers a full range of query operators and optimizations which we would have to implement by hand with PostgreSQL, and would generate code that is hard to maintain, hard to improve, and very error-prone.

Similarly, Redis proved to be an amazing tool for simple and efficient in-memory key-value queries, including our main use case of geospatial queries which are gracefully performed by the GEORADIUS operator.

All in all, we learned that there is no silver bullet when it comes to picking a database to deal with complex records such as places, which will be used by multifaceted applications. Each one will have pros and cons, and it is up to the developer to choose how to combine them to best suit their needs.

Finally, our architecture was also able to handle large-scale batch processing of our whole dataset by distributed algorithms and machine learning models. Subsequently, we were able to explore and iterate quickly over solutions for problems found in our data.

Consistency Issues

After some time using this architecture, we started noticing a few issues. One of the most important ones was that all synchronization between our three data sources was being done programmatically by us, which led to several inconsistencies between the databases.

Just like that, by ingesting a lot of data from different sources, we began to have several data quality issues. For instance: we mentioned that whilst receiving a new record, we needed to check whether or not it was already present in the database. As it turns out, this problem is far from easy, and is in fact touched upon by different scientific research areas, such as Entity Resolution/Record Linkage.

Meanwhile, as we started to acquire more knowledge in dealing with places data, we noticed that almost all of the issues approached by us were complex enough to deserve their own blog post.

Summing them up, we highlight:

Places deduplication/matching;
Synchronization between data sources;
Categorization of Places (detecting if a given place is a restaurant, a supermarket, etc.);
Detecting chain stores;
Enriching and normalizing address data;
Detecting hierarchy between places;
Generating Place entities through other data, such as movement patterns.

Along the years, we have gained significant experience in dealing with all the aforementioned architectural and data issues, but more so in the DB synchronization and places deduplication tasks.

These tasks are inherently connected, since both of them affect our data consistency, that is: the former deals with consistency among different data sources, while the latter deals with consistency in a single data source. Thus, we go deeper into these two problems and how we dealt with them.

Synchronizing Different Databases

Suppose that you are performing queries in two different endpoints of a single API, one of them providing full text search, and another one providing geospatial queries. Among the first query’s results, you receive a record with an id A, which has latitude and longitude coordinates of <-8.0631229, -34.8717757>. Then, you query the geospatial endpoint, looking for places in a 100 meter radius from these exact coordinates, but record A is nowhere to be seen.

Frustrating, isn’t it? This was one of the synchronization issues that was happening in our Places API. By manually replicating INSERT, UPDATE, DELETE, and MERGE actions in our DB Writer, a failure in executing the given action in one of the DBs and a success in the other ones meant that we were leaving our DBs with inconsistent states.

In order to fix that, we could attempt to roll back each successful operation if another DB resulted in a failure, but these rollback operations in turn could fail as well. Basically, it is hard to perform queries in a transactional behavior for three different databases simultaneously.

It was at that time that we read about the Change Data Capture (CDC) pattern, and Debezium. A recent post in our blog already mentioned Debezium and the CDC pattern at depth, but in short, Debezium is able to read every change being applied to a database (in our case, PostgreSQL) and stream it to Kafka in an event sourced format, similar to our places-db-commit-log.

By consuming these records, one would be able to not only reconstruct a previous state of the database, but also replicate them elsewhere, which, fortunately, was just our use case. Hence, we configured and started using Debezium to stream data from our canonical PostgreSQL data storage, and made use of Kafka sink connectors to read the data and write to our Elasticsearch and Redis databases. Figure 3 displays our debezium integration.

While Confluent has a public Elasticsearch sink connector available, we had to create the Redis sink ourselves, since no market solution seemed to handle both record deletions and geospatial data as well as we’d hoped. We intend to make this project public in the future.

Figure 3 — Debezium integration with Elasticsearch and Redis connector sinks.

By modifying this part of our architecture, we consolidated PostgreSQL as our canonical data storage, and provided eventual consistency to all our other DBs.

Having eventual consistency means that eventually all queries for a specific record will return the same value, and, in our case, eventually boiled down to latencies upwards of tens of milliseconds during normal operation. Also, we were able to get rid of a great chunk of code in our DB Writer, which led to the service being much easier to maintain and less error-prone.

Places Deduplication

Deduplicating places is probably the hardest of the issues we’ve faced, and whose resolution would be most relevant to our company. Since we consume data from different sources, including the web (which is, to put it mildly, noisy), it is common for us to receive Place entities representing the same real world Place over and over again.

As mentioned in our architecture, we need not only to solve this problem in an online fashion, but also via batch processing of our data.

We define the problem of Places deduplication as such:

Given a set of place entities, how do we detect if they represent the same real-world place?

Which, in turn, can be relaxed to:

Given a pair of place entities, how do we detect if they represent the same real-world place?

Our first attempts to solve this problem were very domain-specific and manual, but resulted in an unacceptable rate of false positive and false negatives. Thus, we also started to explore machine learning approaches for this issue, backed up by other researches in the area of Record Linkage/Entity Resolution that tended to go this way [1, 2].

Subsequently, our current deduplication pipeline in production makes use of distributed computation to partition data and generate potential duplicate place pairs, and then sends each pair to a model which discerns between pairs indicating duplicate places (positive) and pairs indicating non-duplicate places (negative). This model was first trained as a Random Forest on top of pairwise place features, and then adapted to the LGBM framework in order to reduce memory footprint for online usage.

The binary output, which solves the relaxed version of the problem, can then be brought back to the original statement by representing each place as a node in a graph in which edges connecting two arbitrary places indicate that they are duplicates. Detecting the connected components of this said graph gives us the end result we are looking for.

As Figure 4 shows, we reached an end-to end solution for the problem, with our classification model generating both batch results and being exposed via APIs for real-time queries.

By offering APIs for our solution, we were able to improve real-time detection of duplicate places, preventing data inconsistencies before they even happened in the database itself. At the same time, our model reached a normalized gini coefficient of 0.92 on top of our ground truth, and ended up being published as a scientific paper [3].

Figure 4 — Place deduplication pipeline, using a Random Forest as the classification model.

We were not fully satisfied with the results, however, and following successful reports of deep learning usage to solve matching problems in other domains, we decided to explore that field.

This resulted in a new solution which is well on its way to replace the classification method shown in Figure 4 by a deep neural network architecture that aims to generate word level, character level, geographical level, and category level embedded representations of places and compare them to generate a binary output, akin to Figure 5. Although not being in production yet, this new method was already able to outperform our previous one in several metrics.

Figure 5 — A simplified view of our deep neural network for duplicate places classification.

Another key takeaway we had during exploration of this issue was that including human feedback in the deduplication loop is crucial, since deduplication is a hard task and not even the best algorithms will get everything right. This coincided with the creation of an internal Places QA team, and a shift in our focus to make direct user-made operations have the highest possible priority in our DB, even over our models.

We attempted to improve that by creating a Places Playground, where place pairs for which our models are not so sure are sent to undergo a voting process. Positive pairs are then consolidated in our DB, and help us improve our ground truth set. Figure 6 displays a sample pair being taken up to be voted on in the Places Playground.

Figure 6 — A sample place pair up for voting in our Places Playground. Notice that this case was hard for our model due to slight differences in addresses and lat lngs, which may point to different stores of a chain inside the same mall.

Our Architecture Today

Since we only touched upon a few core issues that we tackled during the latest months, Figure 7 displays only part of our full architecture, omitting some new services and flows that would subtract readability from this post.

Figure 7 — Our current architecture, with some services filtered out.

Closing Remarks

In this post, we described a past state of our places data platform in an architectural and data level, and touched upon some of the core issues we had to tackle, while providing some of the key learnings we had. We hope this article is of use to anyone exploring Places data or building a data platform.

References

[1] Santos, R., Murrieta-Flores, P., Calado, P., and Martins, B. (2018). Toponym Matching through Deep Neural Networks. International Journal of Geographical Information Science, 32(2).

[2] Yang, C., Hoang, D. H., Mikolov, T., & Han, J. (2019). Place Deduplication with Embeddings. The World Wide Web Conference on — WWW ’19. The World Wide Web Conference. https://doi.org/10.1145/3308558.3313456

[3] Cousseau, V., Barbosa, L. (2019). Large-scale Record Linkage of Web-based Place Entities. In Proceedings of the 34th Brazilian Symposium on Databases. https://sol.sbc.org.br/index.php/sbbd/article/view/8820