Approach to Understanding User Better: ER Steps

Nanda Anzana
4 min readMar 16, 2023

--

There are (five) chapters related to entity resolution topics. This is chapter 3. Check out Chapter 1 and Chapter 2 in case you missed them.

We will get more hands-on on resolving Entity Resolution using your internal data.

Let’s start with the simulation case as we work in the high pace fintech industry. For the broader scope, we can assume we are working with one of the top e-money players and were given a task to increase the number of win-back users by providing promotional offers for accounts that have been inactive for a long time. We want to maximize our budget by knowing how many “real people” we can give promos. But you are confused; how do we know this “real person,” and what are the proper steps to start?

With your analytical experience, you know you need precise, clean, and standardized data to analyze anything. Then you start looking for data that is spread across the various platforms that you have. Account data in the application database, KYC data in the operational database, and even addresses stored in the NoSQL database that have not been synchronized to Datamart. You start collecting the data and choose which data can be a parameter in determining if the account belongs to the same person. But you notice some problems; the data you select sometimes have different names but the same context. Then you identify which data has these problems and collect them.

Multiple sources of data

After the data is collected, there must be problems with the data collected. The data may have many missing values, writing errors, or not standard data. If these things happen, we must do data cleansing. Data cleansing has a significant role because we try to avoid garbage in and out as much as possible. For example, after the data has been collected, it turns out that we find that not all telephone number data are in the form of telephone numbers; some contain alphabetic characters that you must clean. Sometimes if the admin enters the telephone number manually, there may be an error in writing the number 0 to the letter O or the number 9 to the letter g. If two phone numbers are the same but have this error, then by clearing it, we can make the system recognize both phone numbers that are the same and connected to 2 different addresses.

After correcting the erroneous data, the next step is to transform the data so that the data has the same context. Suppose we have address data; address often needs to be better standardized and requires several steps to standardize it. Eliminating punctuation, capitalizing letters, and removing extra spaces can be done to standardize address data. Furthermore, if the address data is like an Indonesian address, it has its complexity.

Oppna can extract your Indonesian address and will help you standardize your address data.

After the data is cleaned and standardized, we can identify similar data. Data like this usually exists because the system does not realize the similarity of data, but after we do the cleaning, the system can compare the similarity of the data. We can reduce data that are not the same so that we can be more effective later in entity resolution. After we have done the above, our data is ready for entity resolution.

The mechanisms of Entity Resolution are relatively straightforward. There are only two things that are done: record linkage and also canonicalization.

Record linkage is how we connect between identifiers that define an entity. How to make a connection can be done in several ways:

  • Deterministic approach: Deterministic matching leverages first-party data that customers have provided to unify device-level data to unique customer profiles with 100% confidence. Device-level engagement is linked only when common PII (Personal Identifiable Information) has been shared, prioritizing the accuracy of our customer profiles.
  • Probabilistic approach: Probabilistic modeling ties engagements made by a single user across multiple devices to a unified customer profile by using predictive algorithms to link information such as IP address, operating system, location, wifi network, and behavioral data to an individual at a given confidence level.

We should use a deterministic approach for data the customer declares, for example, id, phone number, and email. While for anonymous data, we can use the probabilistic approach as it more flexibly defines connections between identifiers and more reliably classifies them. It is also possible to combine both methods depending on existing data within the gray area.

Canonicalization is to unify records or convert data into a standard form. For all connected identifiers, a new identifier will be created, which defines that identifier defines one entity.

Oppna will help you manage your data in one place. Save your time and get insight more quickly!

--

--