Tools for big data entity resolution

8 min readApr 27, 2023

There are (five) chapters related to entity resolution topics. This is the last chapter, you can check out our previous chapters in case you missed it.

Making entity resolution analysis for thousands or hundreds of thousands of data can still be done easily by using excel or other table analysis. If we want to be more advanced, we can use python to create entity resolution rules. The problem exists if we have millions, even hundreds of millions of data. How do we do entity resolution if the data we have is very large or we could say big data?

One way to process big data is to use a cloud platform. By using a cloud platform we can be more efficient in analyzing because we only pay for the resources we use. So we don’t need to spend the budget to invest in buying resources that will not be fully utilized later. One cloud platform that can be used to try entity resolution is GCP (Google Cloud Platform).

After knowing that the platform used is GCP, now we will discuss what software is suitable for entity resolution task. As we can see, entity resolution is very closely related to the relationship between PII. We will use the relationship between PII later as the foundation for predictions that we will do in entity resolution. As we discuss in previous chapter, we can see PII entities as a subject/nodes. Then we can connect these nodes by using lines which we can call edges. For refresher, see the image below.

The data structure as shown in the image above is a graph. Graph is a structure amounting to a set of objects in which some pairs of the objects are in some sense “related”. The objects correspond to mathematical abstractions called vertices (also called nodes or points) and each of the related pairs of vertices is called an edge (also called link or line). In analyzing or transforming ordinary datasets into a graph in GCP, we can use a subset of the Spark Framework, such as GraphX, or a dedicated graph database (Neo4j or Tiger graph). There are pros and cons for each method.

GraphX

According to this documentation page, GraphX is a new component in Spark for graphs and graph-parallel computation. At a high level, GraphX extends the Spark RDD by introducing a new Graph abstraction: a directed multigraph with properties attached to each vertex and edge. This Graphx will run together with the spark framework which we can run on one of the GCP services, dataproc. By using spark which is already integrated with GCP, we can more easily connect to the data warehouse or to a relational database and transform that data into a graph structure. But for every analysis that is run, spark needs to read all the data needed, and do the transformation “on the fly” because basically spark is a framework for doing transformations and not for storing data. Overall, it can be concluded, benefits:

No need to learn a new language
Already connected to the GCP system
Suitable for big data processing
Parallel computing optimization

Drawbacks:

Need to initialize dataproc cluster
Read and change the data structure every analysis/transformation
Only a partial graph algorithm can be implemented
Not yet efficient in memory allocating graph data structure

Neo4j

One of the most popular graph databases is neo4j. Neo4j is a graph database management system developed by Neo4j, Inc. Described by its developers as an ACID-compliant transactional database with native graph storage and processing, Neo4j is available in a non-open-source “community edition” licensed with a modification of the GNU General Public License, with online backup and high availability extensions licensed under a closed-source commercial license. Neo also licenses Neo4j with these extensions under closed-source commercial terms. Within the GCP ecosystem itself, Neo4j Enterprise can be found on the marketplace. There are several things that must be considered when choosing to use the community edition or enterprise edition:

Parallel processing limitations
Security
Data management
Scale and availability
Monitoring and management

If we want to use Neo4J for proof of concept analysis, we can use the Neo4j community edition running on your compute engine. Neo4j provides an application image that can be deployed easily using Docker. Neo4j is the right choice if later the analysis will be implemented repeatedly. Apart from being a dedicated storage for graph-type data, Neo4j also provides access to graph visualization, so that the analysis carried out can be more comprehensive. Because it is a dedicated database, Neo4j or other dedicated graph databases use separate queries. Even though the keywords used are similar to the commonly used SQL, the construction of the query is slightly different and may need some adjustments. Overall, it can be concluded, benefits:

Is a dedicated database for graphs
Abstractions and transformations have been handled by Neo4j
Almost all graph algorithms are available
Efficient in memory allocating and caching
Graph visualization
API integration

Drawbacks:

Learn a new query language
Not yet integrated with the GCP system
Manage your own services

In this chapter, it will be assumed that data is processed using Neo4j, so that later it can be scalable. To configure how to deploy Neo4j on the compute engine, you can click here. There are several things that will be done, from translating data to canonicalizing IDs to defining one person. Steps taken:

Define subjects (nodes) and relationships (edges)
Load data from RDBMS or Data Warehouse platform into Neo4j
Entity resolution in Neo4j
Extract data from Neo4j

Define subjects (nodes) and relationships (edges)

Imagine we have a dataset like the table below. How do we define the subject and the relationship it has?

╔═════════╦═════════════════╦══════════════╦═══════════════════╗
║ User ID ║      Email      ║ Phone number ║    Create time    ║
╠═════════╬═════════════════╬══════════════╬═══════════════════╣
║ ABC12   ║ abc12@gmail.com ║       628123 ║ 2023-1-1 10:23:23 ║
║ ABC13   ║ abc12@gmail.com ║       628234 ║ 2023-1-1 13:54:20 ║
║ ABC14   ║ abc14@gmail.com ║       628345 ║ 2023-1-2 11:24:15 ║
╚═════════╩═════════════════╩══════════════╩═══════════════════╝

We can define the User ID, Email and Phone number columns as subjects (nodes) while we can make create time one of the properties on the nodes we have. As we can see above, the column has a relationship which is the data is declared by the user. So that we can provide a HAS relationship (the name is chosen so that it is self-explanatory). For more details, see the image below.

The image representation above is only one way of translating the data that we have, there are many other ways of translating the dataset above according to our business respective.

Load data from RDBMS or Data Warehouse platform into Neo4j

Because there is no direct connector from RDBMS or Data Warehouse to Neo4j, the most effective way is to use csv. Data from the RDBMS is transformed (join filter etc.) then dumped into csv which can be accessed using a public link or directly dumped into the compute engine folder used for serving Neo4j. After the csv has been created, we can trigger the Neo4j API to read the file and use a custom chyper query to create nodes and edges.

The connector that can be used to perform transformations and ingest end-to-end is python. For connectors to sources, we can use various packages, starting from psycopg2, sqlalchemy, or google sdk. As for the connector to Neo4j, we use py2neo because of the ease of syntax and methods it has. The data assumptions are like the data below.

╔══════════════╦═════════════════╦════════════╗
║ source_value ║  target_value   ║     id     ║
╠══════════════╬═════════════════╬════════════╣
║ ABC12        ║ abc12@gmail.com ║ abc12abc12 ║
║ ABC13        ║ abc12@gmail.com ║ abc13abc13 ║
╚══════════════╩═════════════════╩════════════╝

We can use a custom query to create nodes and relationships:

USING PERIODIC COMMIT 5000 LOAD CSV WITH HEADERS FROM "{{SOURCE_URL}}" AS row 
MERGE (s:USER {id: row.source_value}) 
MERGE (t:EMAIL {id: row.target_value}) 
MERGE (s)-[r:HAS_EMAIL {id: row.id}]->(t)
ON CREATE SET r.created_time = TIMESTAMP();

Entity resolution in Neo4j

In this record linkage, we will focus on providing relationships for USER nodes that share the same PII, for example EMAIL nodes. So you should provide a new relationship, namely HAS_SHARED_ATTRIBUTES from USER ABC12 to ABC13 as shown below.

For the chyper query, we can use:

match (s1:USER)-[:HAS]-> (e:EMAIL)<-[:HAS]-(s2:USER) 
where id(s1) < id(s2) # Deduplicate multiple POV with s1,s2 
merge (s1)-[r:HAS_SHARED_ATTRIBUTES]->(s2) 
SET r.shared = case when r.shared is not null 
  and 'email' not in r.shared 
  and r.shared is not null 
  then r.shared + 'email'
else ['email'] end

The chyper query above is limited to create a relationship to USER who has the same email, if another PII is used as a connector, it is necessary to make changes according to the existing PII connection type. After providing a relationship for each user id that has the same attributes, we will perform canonicalization. This canonicalization is done by first making interconnected nodes have the same ID properties. First step for canonicalization in Neo4j, we need to make a projection of the Graph, by:

CALL gds.graph.project('entityResolutionGraph', 'USER', 'HAS_SHARED_ATTRIBUTES' ) 
CALL gds.wcc.write ('entityResolutionGraph', { writeProperty: 'PERSON' })
YIELD nodePropertiesWritten, componentCount;

Using the chyper query above, we can create properties for each connected node, having the same PERSON. After assigning the properties, we can create a new PERSON node which has all the edges of the previous USER nodes, switching all to PERSON.

match (n:USER) where n.PERSON = $PERSON with collect(n) as toclone 
CALL apoc.refactor.cloneNode(toclone, true) 
YIELD output as cloned with collect(cloned) as clones 
call apoc.refactor.mergeNodes(clones, { properties: "combine", mergeRels: true }) 
YIELD node as merge_result match (n:USER) where n.PERSON = $PERSON 
merge (n)-[:IDENTIFIED_AS]->(merge_result) with merge_result, n, labels(merge_result) as old_labels 
set merge_result:PERSON with merge_result, n, old_labels 
call apoc.create.removeLabels(merge_result, old_labels) yield node
return true

Make sure the apoc procedure is enabled in your Neo4j

For a graph that has been canonicalized it will look like the image below:

Extract data from Neo4j

Before extracting data from Neo4j, you can verify that the PERSON nodes have less frequency than USER nodes. You can verify by using chyper query:

CALL apoc.meta.stats yield labels

Here is an example of how to extract data on how many emails are owned for each existing PERSON node.

match (n:PERSON)
return n.PERSON, size((n)-[:HAS]->(:EMAIL)) as count_email

By using the chyper above, we can find out how many emails each person has, this query can be developed further according to our business needs. There are lots of insights that can be obtained from the entity resolution that we have made so far. Going further, maybe we can consider using an orchestrator and also an end-to-end pipeline generator for all the things we have done. Data standardization is also important if the resulting insights are to be consumed on a regular basis. So data quality and data integrity management is something that must be done. So that you can get insight that is both robust and reliable.