Address Entity Matching with DeepMatcher

Published in

Reputation.com Datascience Blog

5 min readJun 24, 2020

The tradition way to match two similar addresses is through mapping the addresses to their corresponding coordinates. In scenarios where we don’t have the access to the geographic coordinates, or there’s a need for a faster way to tell if there’s a match based on the text, the package DeepMatcher from anhaidgroup can come in handy. Here, I am going to walk you through how I tackled entity matching with DeepMatcher inside a docker container and the model results.

Goal

The motivation behind this is that U.S. addresses are tricky. We have route 17 south and NJ-17 both referring to the same address. It is easy for us human to tell that they are probably the same addresses but how do we train our model so that it can identify the same addresses without any geographical information? Given the two similar addresses, the goal here is to identify that they are the same address then output the normalized address. That way the data in the database can be more consistent and it also solves the discrepancy issue between the addresses in the database and Google address from Google API.

Methodology

I experimented with SIF, RNN, Attention model and a Hybrid model inside the DeepMatcher package. DeepMatcher calculates the distance of the words for the two addresses either by its Attribute Embedding or by the Attribute Similarity. For Attribute Embedding, the package provides an aggregate function model (SIF) and a sequence aware model (RNN). For Attribute Similarity, there is an Attention model and a hybrid model (A sequence aware with attention model) for experimentation.

Here is a graph explaining how the workflow works from the DeepMatcher paper. We will have an attribute embedding layer processing the sequences of addresses. The outputs will be sent to calculate the similarity scores of the sequences. The SIF model will be calculating the weighted average of the attribution and an element-wise absolute difference. The RNN model differs in that it takes the sequence of the words into account. The Attention model uses both a decomposable attention and vector concatenation. The Hybrid model is a hybrid of a bidirectional RNN and a vector concatenation (Papineni, Roukos, Ward, & Zhu, 2001).

The architecture template for DL solutions for EM

To start the modeling pipeline, I recommend working inside a Docker Container for DeepMatcher because of its dependency issues. Here is a Docker Image for DeepMatcher:

FROM python:3.6.10-buster# Suppress warnings about missing front-end. As recommended at:# http://stackoverflow.com/questions/22466255/is-it-possibe-to-answer-dialog-questions-when-installing-under-dockerARG DEBIAN_FRONTEND=noninteractiveRUN pip install deepmatcherRUN pip install sklearnWORKDIR "/root"CMD ["/bin/bash"]

For step by step model running tutorial, the DeepMatcher GitHub (https://github.com/anhaidgroup/deepmatcher) provides detailed code. To run the models you could download the Docker image and run your python script with the following code in the terminal:

docker run -it -v $(pwd):/root --rm deepmatcher python main.py

Model Results

With Docker Container I was able to train 69386 labeled address records throughout the U.S. and each model takes around 4 hours to complete. I experimented with different model parameters, for my particular training set, the batch size of 16 works the best, it doesn’t generalize too much and I am not worried about overfitting because I want my models to be specific to my dataset. I am using F1 as my north-star metrics here because our imbalanced dataset. 88% of my training labels is “matches” (label 1) after I upsampled “non-matches ” (label 0).

The above result table demonstrates how well the models perform: all the models have a fairly high F1 score. And let’s see what the output looks like from a Hybrid Model:

DeepMatcher is able to identify rows with missing cities. We can identify matching addresses such as Route 309 and PA-309 fairly confidently with similarity score of 0.999. It also gives a low score (around 0.3) for outputs that are non-matches. The next step here could be generating addresses for the PO Boxes for the models to learn.

The F1 scores are similar but if we look at the model outputs, they are quite different. For example, SIF and Hybrid perform the best out of the 4 models, but SIF tends to output a lower score for non-matches but Hybrid still outputs a high F1 score for non-matches. If the business goal is to identify all the non-matches over matches, SIF is a more suitable model. Even though the training scores are similar, it is important to choose the model based on business needs.

Conclusion

DeepMatcher is an efficient state-of-art entity matching tool. When it comes to choosing different models, it is important to look at the output results and choose the one that suits the specific business needs. My models are overfitting to my dataset since the F1 scores are particularly high. My goal is for the models to learn from the database so that when the same address from the clients show up again in the dataset, the model can recall what it has learned in the training set. The model parameters should be adjusted based on business needs.

Thanks anhaidgroup for providing the tool for entity matching problems.

Reference:

anhaidgroup/deepmatcher

DeepMatcher is a Python package for performing entity and text matching using deep learning. It provides built-in…

github.com

Mudgal, S., Li, H., Rekatsinas, T., Doan, A., Park, Y., Krishnan, G., . . . Raghavendra, V. (2018). Deep Learning for Entity Matching. Proceedings of the 2018 International Conference on Management of Data — SIGMOD ’18. doi:10.1145/3183713.3196926 (http://pages.cs.wisc.edu/~anhai/papers1/deepmatcher-sigmod18.pdf)