Tools for big data entity resolution

Nanda Anzana
8 min readApr 27, 2023

--

There are (five) chapters related to entity resolution topics. This is the last chapter, you can check out our previous chapters in case you missed it.

Making entity resolution analysis for thousands or hundreds of thousands of data can still be done easily by using excel or other table analysis. If we want to be more advanced, we can use python to create entity resolution rules. The problem exists if we have millions, even hundreds of millions of data. How do we do entity resolution if the data we have is very large or we could say big data?

One way to process big data is to use a cloud platform. By using a cloud platform we can be more efficient in analyzing because we only pay for the resources we use. So we don’t need to spend the budget to invest in buying resources that will not be fully utilized later. One cloud platform that can be used to try entity resolution is GCP (Google Cloud Platform).

After knowing that the platform used is GCP, now we will discuss what software is suitable for entity resolution task. As we can see, entity resolution is very closely related to the relationship between PII. We will use the relationship between PII later as the foundation for predictions that we will do in entity resolution. As we discuss in previous chapter, we can see PII entities as a subject/nodes. Then we can connect these nodes by using lines which we can call edges. For refresher, see the image below.

Graph structure

The data structure as shown in the image above is a graph. Graph is a structure amounting to a set of objects in which some pairs of the objects are in some sense “related”. The objects correspond to mathematical abstractions called vertices (also called nodes or points) and each of the related pairs of vertices is called an edge (also called link or line). In analyzing or transforming ordinary datasets into a graph in GCP, we can use a subset of the Spark Framework, such as GraphX, or a dedicated graph database (Neo4j or Tiger graph). There are pros and cons for each method.

Big data tools for graph analysis

GraphX

According to this documentation page, GraphX is a new component in Spark for graphs and graph-parallel computation. At a high level, GraphX extends the Spark RDD by introducing a new Graph abstraction: a directed multigraph with properties attached to each vertex and edge. This Graphx will run together with the spark framework which we can run on one of the GCP services, dataproc. By using spark which is already integrated with GCP, we can more easily connect to the data warehouse or to a relational database and transform that data into a graph structure. But for every analysis that is run, spark needs to read all the data needed, and do the transformation “on the fly” because basically spark is a framework for doing transformations and not for storing data. Overall, it can be concluded, benefits:

  • No need to learn a new language
  • Already connected to the GCP system
  • Suitable for big data processing
  • Parallel computing optimization

Drawbacks:

  • Need to initialize dataproc cluster
  • Read and change the data structure every analysis/transformation
  • Only a partial graph algorithm can be implemented
  • Not yet efficient in memory allocating graph data structure

Neo4j

One of the most popular graph databases is neo4j. Neo4j is a graph database management system developed by Neo4j, Inc. Described by its developers as an ACID-compliant transactional database with native graph storage and processing, Neo4j is available in a non-open-source “community edition” licensed with a modification of the GNU General Public License, with online backup and high availability extensions licensed under a closed-source commercial license. Neo also licenses Neo4j with these extensions under closed-source commercial terms. Within the GCP ecosystem itself, Neo4j Enterprise can be found on the marketplace. There are several things that must be considered when choosing to use the community edition or enterprise edition:

  • Parallel processing limitations
  • Security
  • Data management
  • Scale and availability
  • Monitoring and management

If we want to use Neo4J for proof of concept analysis, we can use the Neo4j community edition running on your compute engine. Neo4j provides an application image that can be deployed easily using Docker. Neo4j is the right choice if later the analysis will be implemented repeatedly. Apart from being a dedicated storage for graph-type data, Neo4j also provides access to graph visualization, so that the analysis carried out can be more comprehensive. Because it is a dedicated database, Neo4j or other dedicated graph databases use separate queries. Even though the keywords used are similar to the commonly used SQL, the construction of the query is slightly different and may need some adjustments. Overall, it can be concluded, benefits:

  • Is a dedicated database for graphs
  • Abstractions and transformations have been handled by Neo4j
  • Almost all graph algorithms are available
  • Efficient in memory allocating and caching
  • Graph visualization
  • API integration

Drawbacks:

  • Learn a new query language
  • Not yet integrated with the GCP system
  • Manage your own services

In this chapter, it will be assumed that data is processed using Neo4j, so that later it can be scalable. To configure how to deploy Neo4j on the compute engine, you can click here. There are several things that will be done, from translating data to canonicalizing IDs to defining one person. Steps taken:

  1. Define subjects (nodes) and relationships (edges)
  2. Load data from RDBMS or Data Warehouse platform into Neo4j
  3. Entity resolution in Neo4j
  4. Extract data from Neo4j

Define subjects (nodes) and relationships (edges)

Imagine we have a dataset like the table below. How do we define the subject and the relationship it has?

╔═════════╦═════════════════╦══════════════╦═══════════════════╗
║ User ID ║ Email ║ Phone number ║ Create time ║
╠═════════╬═════════════════╬══════════════╬═══════════════════╣
║ ABC12 ║ abc12@gmail.com ║ 628123 ║ 2023-1-1 10:23:23 ║
║ ABC13 ║ abc12@gmail.com ║ 628234 ║ 2023-1-1 13:54:20 ║
║ ABC14 ║ abc14@gmail.com ║ 628345 ║ 2023-1-2 11:24:15 ║
╚═════════╩═════════════════╩══════════════╩═══════════════════╝

We can define the User ID, Email and Phone number columns as subjects (nodes) while we can make create time one of the properties on the nodes we have. As we can see above, the column has a relationship which is the data is declared by the user. So that we can provide a HAS relationship (the name is chosen so that it is self-explanatory). For more details, see the image below.

User relationship to email

The image representation above is only one way of translating the data that we have, there are many other ways of translating the dataset above according to our business respective.

Load data from RDBMS or Data Warehouse platform into Neo4j

Because there is no direct connector from RDBMS or Data Warehouse to Neo4j, the most effective way is to use csv. Data from the RDBMS is transformed (join filter etc.) then dumped into csv which can be accessed using a public link or directly dumped into the compute engine folder used for serving Neo4j. After the csv has been created, we can trigger the Neo4j API to read the file and use a custom chyper query to create nodes and edges.

Transformation process

The connector that can be used to perform transformations and ingest end-to-end is python. For connectors to sources, we can use various packages, starting from psycopg2, sqlalchemy, or google sdk. As for the connector to Neo4j, we use py2neo because of the ease of syntax and methods it has. The data assumptions are like the data below.

╔══════════════╦═════════════════╦════════════╗
║ source_value ║ target_value ║ id ║
╠══════════════╬═════════════════╬════════════╣
║ ABC12 ║ abc12@gmail.com ║ abc12abc12 ║
║ ABC13 ║ abc12@gmail.com ║ abc13abc13 ║
╚══════════════╩═════════════════╩════════════╝

We can use a custom query to create nodes and relationships:

USING PERIODIC COMMIT 5000 LOAD CSV WITH HEADERS FROM "{{SOURCE_URL}}" AS row 
MERGE (s:USER {id: row.source_value})
MERGE (t:EMAIL {id: row.target_value})
MERGE (s)-[r:HAS_EMAIL {id: row.id}]->(t)
ON CREATE SET r.created_time = TIMESTAMP();

Entity resolution in Neo4j

In this record linkage, we will focus on providing relationships for USER nodes that share the same PII, for example EMAIL nodes. So you should provide a new relationship, namely HAS_SHARED_ATTRIBUTES from USER ABC12 to ABC13 as shown below.

Record linkage of user

For the chyper query, we can use:

match (s1:USER)-[:HAS]-> (e:EMAIL)<-[:HAS]-(s2:USER) 
where id(s1) < id(s2) # Deduplicate multiple POV with s1,s2
merge (s1)-[r:HAS_SHARED_ATTRIBUTES]->(s2)
SET r.shared = case when r.shared is not null
and 'email' not in r.shared
and r.shared is not null
then r.shared + 'email'
else ['email'] end

The chyper query above is limited to create a relationship to USER who has the same email, if another PII is used as a connector, it is necessary to make changes according to the existing PII connection type. After providing a relationship for each user id that has the same attributes, we will perform canonicalization. This canonicalization is done by first making interconnected nodes have the same ID properties. First step for canonicalization in Neo4j, we need to make a projection of the Graph, by:

CALL gds.graph.project('entityResolutionGraph', 'USER', 'HAS_SHARED_ATTRIBUTES' ) 
CALL gds.wcc.write ('entityResolutionGraph', { writeProperty: 'PERSON' })
YIELD nodePropertiesWritten, componentCount;

Using the chyper query above, we can create properties for each connected node, having the same PERSON. After assigning the properties, we can create a new PERSON node which has all the edges of the previous USER nodes, switching all to PERSON.

match (n:USER) where n.PERSON = $PERSON with collect(n) as toclone 
CALL apoc.refactor.cloneNode(toclone, true)
YIELD output as cloned with collect(cloned) as clones
call apoc.refactor.mergeNodes(clones, { properties: "combine", mergeRels: true })
YIELD node as merge_result match (n:USER) where n.PERSON = $PERSON
merge (n)-[:IDENTIFIED_AS]->(merge_result) with merge_result, n, labels(merge_result) as old_labels
set merge_result:PERSON with merge_result, n, old_labels
call apoc.create.removeLabels(merge_result, old_labels) yield node
return true

Make sure the apoc procedure is enabled in your Neo4j

For a graph that has been canonicalized it will look like the image below:

Canonicalisation of user

Extract data from Neo4j

Before extracting data from Neo4j, you can verify that the PERSON nodes have less frequency than USER nodes. You can verify by using chyper query:

CALL apoc.meta.stats yield labels

Here is an example of how to extract data on how many emails are owned for each existing PERSON node.

match (n:PERSON)
return n.PERSON, size((n)-[:HAS]->(:EMAIL)) as count_email

By using the chyper above, we can find out how many emails each person has, this query can be developed further according to our business needs. There are lots of insights that can be obtained from the entity resolution that we have made so far. Going further, maybe we can consider using an orchestrator and also an end-to-end pipeline generator for all the things we have done. Data standardization is also important if the resulting insights are to be consumed on a regular basis. So data quality and data integrity management is something that must be done. So that you can get insight that is both robust and reliable.

--

--