Approach to Understanding User Better: Entity Resolution

Nanda Anzana
5 min readMar 16, 2023

--

There are (five) chapters related to entity resolution topics. This is chapter 2. Check out Chapter 1 if you missed it.

It is getting more interesting; this chapter will guide you through the practical definition of Entity Resolution (ER).

Entity resolution (ER) determines whether multiple labels are one real-world object and can be referred to as the same or different objects. Entity resolution requires several things that need to be met. It requires data that has already been cleaned and reduced to the smallest part. Thus, there are a lot of things we need to get done.

Entity resolution is not limited to solving one person’s identity problem or Identity Resolution; if we take it even further, we can expand into two perspectives: Family Resolution and/or Group resolution. See the figure below for an overview of the entity level that can be resolved.

Entity leveling
Figure 1. Entity leveling

If you are curious about more hands-on data cleaning, tools for implementation, and how they will be implemented in big data, you can check Chapter 3, Chapter 4, and Chapter 5.

(1). Identity Resolution

Identity resolution is the process of knowing the different identities of our data that belong to the same person. Why the real identity is the same but recorded differently?

This can happen due to manual recording by humans, data not being cleaned before processing, multiple systems recording without a unique identifier, or user input not standard, making it difficult to know whether the data belongs to the same person. This makes a difference (sometimes quite significant) between our data and the actual user.

Tabel 1. Example User Master Data

By resolving identity resolution, we can do everything related to the user with more optimization. Imagine if we have a budget to promote all accounts with the cost for one account being 1$, the user base we have is 100k accounts. Still, because we have made an identity resolution, we know that 25% of users have multiple accounts (assuming they have two accounts), so we only want to target the most recently active account. So the budget we need is:

We can save up even more than 12.5% if we implement ER with bigger size accounts.

(2). Family Resolution

Let’s see from the family resolution level. We want to know whether an account belongs to the same household. How can we do this? It is straight forward approach by connecting identifiers such as correspondence address, citizen ID (e.g. NIK for Indonesian citizen), bill payment transaction, and other identifiers that indicate an account belonging to the same household. On the other hand, for group resolution, which is more complex, we can use a combination of probabilistic and deterministic approaches to determine whether the connection between users is in the same group, such as an office co-worker, or belong to the same community.

Should we care if our users belong to the same family? Wouldn’t it be great if we knew how many of our users turned out to be family? although people tend to have their characteristics, if they are in the same family, they usually share the same values, so we can assume they will react to the same things. Because of that, as business owners, it will be easier to target people with family values according to our target market. In addition, we can target items related to family needs by knowing that several users are in one family.

(3) Group Resolution

The group can be divided into many different entities. Still, this article will discuss two large groups that usually exist in society: Office (co-workers) and Community.

What value do we get if the people are in the same office?

Suppose we have an e-money company. Our company wants to make a Pay later product that targets office workers. User A, who has been in our application for a long time, wants to use pay later. But before using pay later products, there is a verification stage where the user consents to verify their data to his/her office. Assume that there is user B (after we have made Office resolution) who is in the same office as user A and looks close to user A in transactions; then, we can contact user B to verify whether it is true that user A works for the company he/she mentioned. Also, we can ask whether that is also where user B works. In the end, we can be more confident and trust the information provided by user A.

Why is it good to know people are in the same community?

Then community resolution, imagine we own the largest e-commerce company. We want to target communities whose members are active on our platform to be given a wholesale purchase promo. We can detect this community from the users we already have and choose which user is the center of that community to target. Then what items are suitable for the promo? We can also choose items based on the community’s interests.

Furthermore, we can see how the behavior of a neighborhood is by looking at the families living around that neighborhood. We can see whether the neighborhood is problematic (by looking at how many families who commit fraud come from that neighborhood).

It’s always nice to know how our customer connected to each other and derive value to our business. How we can do that? What initial step we need to do? What type of data we need to analyze?

Oppna will help you understand connection of your users just in one hit!

--

--