Network Science for Entity Resolution
At the foundation of every decision in Casavo, there is data. Any decision we make has to be backed up with facts and numbers, and the effect of every subsequent action must be quantitatively measured.
Think of the following use cases:
- Discovering where in Madrid houses are becoming more expensive;
- Calculating what are the neighborhoods of Paris where apartments are sold faster;
- Training an Automatic Valuation Model (AVM) to predict the price of a property
In each use case I just listed, it is crucial to have high-quality data that closely describes the context in which we operate. But, what kind of data? The information we need has to capture how the real estate market is evolving and the biggest source of this is given by listings, namely adverts for a property that is for sale.
Every day thousands and thousands of listings are published online by real estate agents and private sellers on various listings platforms. At Casavo, we ingest a lot of this data, buying them from multiple players. Every day, our jobs connect to their services, ask for the latest update and analyze fresh data. However, by doing so, we also ingest a lot of duplicates. Indeed, it’s very common that the same property is listed more than once across multiple listing platforms, or even within the same one, to make it more visible. Since we mentioned numbers, we found properties that have been listed more than 100 times in a couple of months!
This great amount of duplicates hinders our need for data quality: for instance, if we want to calculate whether the average asking price in Malasaña (Madrid) is going up, we need to get rid of all the properties which were listed more than once, or else the average price would be biased. And this is where Entity Resolution comes into play.
What is Entity Resolution?
Entity Resolution, also known as record linkage or data deduplication, is the process of identifying and merging duplicate records in a dataset. This is often done to create a single, unified view of an entity across multiple systems or databases. The process typically involves comparing data elements, such as names, addresses, or other unique identifiers, to determine if two records refer to the same entity. And it goes without saying, it can be applied to listings as well.
Being able to collapse altogether all the listings referring to the same properties, enables us to perform a more accurate analysis. For instance, knowing when a property was listed for the very first time will tell us how long it has been on the market (and hence how long it takes to sell a property similar to that one).
How we used to resolve entities…
Until a few months ago, Entity Resolution was performed by a simple SQL query that would merge listings that refer to properties of common characteristics, such as address, floor, area, and price.
That query would do the job, but there was a lot of room for improvement. Indeed, it has a hard time catching the cases in which one of the properties had a NULL value for one of the features compared, or if one of the features slightly changes, such as a price fluctuation.
… and how we do it now!
Proceeding in this way led us to obtain a lot of false negatives, namely listings that were indeed selling the same property, which weren’t coupled together because of lacking information.
We needed to find an alternative to the simple SQL-based approach.
By inspecting our data, we found out that the textual description of duplicated listings is quite often the same and rarely NULL… bingo! We had another criterion to leverage!
However, using the bare description is dangerous for a range of reasons: above all because even very little modification would screw up the identity. That’s why we didn’t decide to abandon the good-old SQL query, yet to use both methods as alternative criteria for clustering.
And this is where we started to think of it as a network problem.
We framed our problem as a graph G = (V, E), where V is the set of all the listings we had in our database, and E is the set of edges connecting them. Our idea was to find a set of edges E that could connect two vertices v₁, v₂ if and only if they represent the same property.
Framing such a problem, allowed using to define two different types of edges:
- Edges that connect two listings if they have the same metadata(address, area, floor…)
- Edges that connect two listings if they have the same textual description
Looking at it visually is way clearer than in words:
This picture says it all. The orange link saves us from having two separate clusters: the two groups on the sides of the picture are connected thanks to their metadata, and then they’re joined because two listings having the same textual description are found.
In a very big network, with millions of listings, we would find a number of subgraphs (a.k.a. connected components): each of them is a cluster of listings selling the same property!
Performing Entity Resolution in this way allowed us to further refine the way we merge listings together, capturing bigger groups and making it overall more accurate. Yet, we needed to go beyond the concept of a simple SQL query to realize this, as the tabular format is not the best suited for graph processing.
Introducing Doduo
We decided to put in place an ad-hoc job that every day fetches all the listings we ingest and performs our Entity Resolution procedure. This is how it works:
- A query is launched on our Data Warehouse to generate the edges of the listings network. Two types of edges are retrieved: metadata-edges and description-edges.
- The edges are loaded in a graph, thanks to the Python library networkx. Now the graph can be used to find all the connected components, namely the cluster of listings selling the same property.
- The result is written back on the Data Warehouse.
The job goes by the name of Doduo, the 1st generation two-headed Pokèmon… I think you can guess why we chose this one!
Just like every job, it needs to be orchestrated: we do it with Argo Workflow, the orchestrator we use at Casavo and that we talked about already here.
Some numbers
The problem we are trying to solve has no ground truth, which means we do not have anything to compare our results with and measure “how good we are” at doing Entity Resolution. There’s no one that can tell “Hey, I know for sure that listings A and B are the same”, apart from an army of humans who could do it manually… but that’s just undoable.
Nonetheless, we are still computing some diagnostic metrics every day that help us understand if our algorithm is working properly. For instance, we calculate:
- How many clusters we can identify: at the moment of writing, we have 450k clusters in Madrid
- The size of the biggest cluster: at the moment of writing, the biggest cluster in Milan is composed of 262 listings
- We compute the F1-score on a manually labeled dataset: at the moment of writing, our score is 0.82
You may think that some of these numbers are insane: how can a single property be published more than 200 times? Well, you’re underestimating how obsessed a property seller can be!
Further improvements & Conclusion
Despite we made significant advancements with Doduo, we know that a lot of improvements are possible. For instance, at the moment we are not incorporating any visual information, despite it being extremely precious for spotting two potential identical properties.
Alternatively, we may think to assign a probability score to our clustering, which tells you how confident Doduo is about its result.
These, and many others, are all little increments we’d like to apply soon. At Casavo, we aim to make consistent, incremental improvements to our products. Our developers stay up-to-date with technology advancements and are willing to try new methods for enhancing the value we offer to our customers. The development of these tools is a collaborative effort involving our Data Scientists, Product Managers, Software Engineers, and Platform Engineers, who work together to bring new ideas to life, build our products, and ensure their usability and availability.