Tools for entity resolution

Nanda Anzana
7 min readApr 27, 2023

--

There are (five) chapters related to entity resolution topics. This is chapter 4, you can check out our previous chapters in case you missed it.

After cleaning our data in previous chapter, in this chapter and next chapter we will get our hand dirty and solve Entity Resolution (ER) in practice. Warning! this chapter will require you to have some experience in intermediate python programming language and intermediate sql query.

An entity (person) usually has several PII (Personal Identifiable Information), for example, we can have several emails, several phone numbers and even several addresses. This information is usually formed in an ERD (Entity Relationship Diagram) table which has a many to many relationship from one entity to another. If you have several related entities, then the complexity of these tables will grow exponentially, and it is very difficult to read them. Imagine we want to do an analysis to find out on average one account has how many addresses, and one address has how many accounts.

╔═════════╦════════════════════════════════════════════╦═══════════╗
║ Account ║ Address ║ Is Active ║
╠═════════╬════════════════════════════════════════════╬═══════════╣
║ A ║ Jalan Sudirman Kav 6 ║ TRUE ║
║ A ║ Jalan Sunter Raya nomor 12 ║ TRUE ║
║ A ║ Jalan Cempaka Raya nomor 45 ║ FALSE ║
║ B ║ Jalan Sunter Raya nomor 12 ║ TRUE ║
║ B ║ Jalan Harapan Mulya nomor 14 RT 007 RW 019 ║ TRUE ║
║ C ║ Jalan Harapan Mulya nomor 14 RT 007 RW 019 ║ TRUE ║
╚═════════╩════════════════════════════════════════════╩═══════════╝

From the example table above, we can see that account A and account B have the same address, account A and account B can be owned by the same person. Furthermore, we can see that B and C also have the same address, using the same logic A and C’s accounts should also be owned by the same person. Assuming that this data is stored in an SQL database owned by the company, we can see that on average one account has how many addresses, and one address has how many accounts in two different SQL queries:

-- Average address for each account
WITH each_account_addresses AS (
SELECT account, count(distinct address) AS count FROM ref_address
GROUP BY account)
SELECT AVG(count) FROM each_account_addresses

-- Average account for each address
WITH each_address_accounts AS (
SELECT address, count(distinct account) AS count FROM ref_address
GROUP BY address)
SELECT AVG(count) FROM each_address_accounts

Then how can we make accounts A, B and C be in one row and we act as the same entity? For accounts A and B (or B and C) we can easily do this by using GROUP BY, but how can A and C be treated the same way? We can do a self join and do the aggregation from there.

We can see that the query is very complex just to extract information that accounts A and C have an indirect relationship through account B. This is a perfect example of why we need the right tools to perform entity resolution. Relational databases and table data types with rows and columns are not very suitable for performing entity resolution because it will add to the complexity of the existing analysis (not to mention the processing costs required for performing self joins). Then what is the right way to represent data to perform entity resolution?

Graph will rescue you!

Graphs can be used to model many types of information. For instance, in a social network, individual people can be represented as nodes, and the relationships between individuals could be represented as a graph. In an electronic circuit, the switches that connect wires and components could be represented as nodes and the wires connecting them would form the edges. And in an internet search engine like Google, each website is a node and hyperlinks connecting websites form the edges of graph.

Graph structure

How can we create this graph structure and do entity resolution? In this chapter we will use networkX (python implementation), and Gephi for visualization and presentation. We will translate all data that we have and transform their structure to graph. NetworkX have ability to automate detect csv file and transform them into graph, after that we will save that graph file into .graphml format to be able read from Gephi.

Tools for entity resolution

NetworkX

NetworkX is a Python package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks. It provides an integrated set of fundamental tools for the construction, visualization, and analysis of the most common types of networks found in empirical applications. Intended for both students and researchers in mathematics, computer science, and the applied sciences, it offers modularity and extensibility to accommodate developments in the field.

Gephi

Gephi is an excellent tool for data analysts and data science enthusiasts to explore and understand graphs. Gephi comes with a pre-built layout where the user can adjust shapes, sizes, and colors to discover hidden patterns in raw edge and node data. The main goal is to enable the user to make a hypothesis, discover hidden patterns, isolate structure singularities and defects during data sourcing.

First of all, we need to determine what graph we will construct. There are 2 types of graph by their direction: directed graph, and undirected graph. We know if we talk about PII relation to entity, the direction doesnt matter (e.g. if we know the email belong some user, then it doesnt matter the direction of that relation. On the contrary, if we want to analyse the money flowing from one user to another, then the direction matters). Assume we have csv like table above and below:

╔═════════╦══════════════════╦════════╗
║ Account ║ Email ║ Phone ║
╠═════════╬══════════════════╬════════╣
║ A ║ user_A@gmail.com ║ 62887 ║
║ B ║ user_B@gmail.com ║ 62887 ║
║ C ║ user_C@gmail.com ║ 62888 ║
║ D ║ user_D@gmai.com ║ 628889 ║
╚═════════╩══════════════════╩════════╝

we can read all the csv using:

import networkx as nx
import pandas as pd
df_address = pd.read_csv("url_address") # Read csv address
df_address = df_address.loc[df_address.is_active == True] # Filter only active address
df_email = pd.read_csv("url_email") # Read csv address
edges_data_address = list(df_address.loc[:,["account", "address"]].itertuples(index=False, name=None))
edges_data_email = list(df_email.loc[:,["account", "email"]].itertuples(index=False, name=None))
edges_data_phone = list(df_email.loc[:,["account", "phone"]].itertuples(index=False, name=None))
G = nx.Graph() # Initialize undirected graph
G.add_edges_from(edges_data_address) # Create nodes and relationship simultaniously
G.add_edges_from(edges_data_email) # Create nodes and relationship simultaniously
G.add_edges_from(edges_data_phone) # Create nodes and relationship simultaniously
nx.write_graphml(G, "entity_resolution.graphml") # Store graph file

We already make our graph. But what are we doing exactly? In above example we assume we has 2 csv consist of data of addresses (many to many relationship) and data from our user database (one to one). Line 4–7 is where we read all of the data necessary to create our graph. We want to use all of the information related to account and do entity resolution in user level. In line 9–11 we transform our data into list of tuple edges which consist only of source and target. This will save us a lot of time because networkX has the ability to detect wether the nodes already created or not. In line 13 we initialize our undirected graph. After that in line 14–16, we tell our graph to consume our list of tuple edges for each entity, and in line 18 we stored our graph into .graphml format for later use. So far, we only talk about our abstract concept of graph, so why dont we have visualization of our already constructed graph?

from pyvis import network as net
g = net.Network(height='400px', width='100%',notebook=True) # Initialize Network classes
g.from_nx(G) # Read graph from networkX representation
g.show('entity.html') # Show graph
Our graph

From figure above, we can see that account A, B, and C are in the same group, and account D become isolated node. How can we treat this graph into simple representation between one account to another account? The way we construct our graph, make our graph into some special case of a graph structure which is 4-partite, or 2-partite graph point of view. What is k-partite graph is beyond scope of this article, but we only use the graph properties to construct our analysis. K-partite network in networkX has method called projected graph which can connect our entity if, their share same common neighbor of another entity (e.g. if account A share address with account B, we can create edge directly from account A to account B).

from networkx.algorithms import bipartite
accounts = set(df_email.account) # Create set of account POV
account_graph = bipartite.projected_graph(G, accounts) # Projected graph into account POV
g_account = net.Network(height='400px', width='100%',notebook=True) # Initialize Network classes
g_account.from_nx(account_graph) # Read graph from networkX representation
g_account.show('account.html') # Show graph
Account representation of graph

From figure above we can see 2 cluster of user, but how can we do this without visualization and make this automate for another set of data?

components = list(nx.connected_components(n_graph)) # Detect connected components
data = [[index, list(x)] for index, x in enumerate(components)] # Create data for pandas with respective id (index)
df_user = pd.DataFrame(data=data, columns=["id", "accounts"]) # Create pandas dataframe
╔════╦══════════╗
║ Id ║ Accounts ║
╠════╬══════════╣
║ 0 ║ [A,B,C] ║
║ 1 ║ [D] ║
╚════╩══════════╝

Congratulations! we already did our first entity resolution. From 4 data of an accounts, we can reduct their data and find common entity shared between accounts. Now we have more robust point of view of our user, and can make optimization in our business.

--

--