Finding The Common Customers of Two or More Organizations While Keeping Personal Data Secret (Part 1)

Galin Dinkov

Published in

Mimirium

9 min readFeb 3, 2020

Part 1/3 — Introduction

Intro

This is the first of three articles where I’ll discuss the the problem how to find who are the common customers of two or more companies without revealing any information about these customers. The full whitepaper can be found here.

This first part will review the problem and will outline a potential solution. The second part will discuss some of the cryptographic algorithms performing private set intersection. And the final third part will demonstrate a command line tool implementing these algorithms.

The Problem

Customers are probably the most valuable asset for most companies. Customers generate the revenue of the company. Also the number of customers affects the evaluation of the company. It takes great effort and cost for the companies to create and keep their customer base. Therefore, it is not a surprise that the companies will always keep their customers information private. Moreover, the personal identifiable information (PII) from the customer data is protected by privacy laws in many countries (GDPR for EU). These laws are usually very restrictive in cases of sharing with third parties and the fines can reach hundreds of millions.

However there are cases where knowing the common customers they have could be beneficial for the companies. They can use this information for cross-selling, special offers or fraud detection (see Applications). Of course this must be done in such a way, so that the other (non-common) customers are not revealed to the third party.

In general there are two or more organizations having their customer data in a software database or a cloud. These organizations want to use specialized software or a service to determine one or more of the following:

Do they share common customers
How many common customers they share
Who are their common customers

The mathematical explanation of the problem will be:

If we have two organizations and their customers are represented as corresponding sets A and B (see Fig. 1) then to answer the questions above we need to solve:

Though this must be achieved without revealing the values of the sets thus keeping them private. This can be achieved using cryptographic technique called Private Set Intersection or PSI.

Fig. 1 — Common Customers — Figure 1 — Common Customers

Solution

There are several protocols for PSI each with different characteristics such as security, speed, amount of data to transfer, number of intermediaries, complexity and more.

Based on the encryption type they could be:

Unencrypted (plain) — the parties exchange data in plain text with each other or with a trusted party (insecure)
Hash-based
Encrypted with public-key cryptosystem
Encrypted with homomorphic cryptosystem
Other (not covered here) — e.g. circuit based or based on oblivious transfer

Based on the 3rd parties involved they could be:

With a trusted third party — two organizations are trusting a third party and providing it with some of their customer data. If the trusted party follows the protocol honestly it will not reveal that data and will give the correct intersection.

Fig. 2 — Using a Trusted Server — Figure 2 — Using a Trusted Server

Here the data is encrypted by the organizations but the keys are controlled by the trusted party. So eavesdroppers will not be able to see the raw data but the trusted party will. This scenario requires extra legal prerequisites like signed contracts between the trusted party and the organizations and could also require the consent of the customer. Hence it is not universally applicable.

With a non-trusted 3rd party — the organizations use a 3rd party in the process of determining its customers set intersection but this party does not receive access to the raw information but only processes it and gives back the result to the organizations.

Fig 3 — Using Semi-Honest Server — *Figure 3 — Using Semi-Honest Server*

In this case encryption keys are controlled by the companies, but they decide to use a third party for some or all of the computations. The third party cannot see the customer data but can compare or process it.

Without a third party — in this case the organizations communicate peer 2 peer with no intermediaries.

In these schemes usually both organizations control parts of the encryption key (SMC) or they use a public-key cryptosystem.

Applications

The use-case presented in this paper is to find the common customers of two or more organizations without revealing any information to third parties. In the real world this could be used by insurance companies or financial institutions to determine risk and detect fraud, or by retailers for cross-selling offers.

However, there are other applications of the technology, some already applied in practice.

For example, some chat messengers use similar techniques to see if two users have common contacts in their contact lists.

Another option is to perform secure database join based on the customer’s data available in both databases, thus extending one with the data from the other but only for the common rows.

Terminology

Encryption — the process of encoding a message or information in such a way that only authorized parties can access it and those who are not authorized cannot.

Decryption — the reverse of encryption, in other words, moving from the unintelligible ciphertext back to plaintext.

Public-key Cryptography — or asymmetric cryptography, is a cryptographic system that uses pairs of keys: public keys which may be disseminated widely, and private keys which are known only to the owner. In such a system, any person can encrypt a message using the receiver’s public key, but that encrypted message can only be decrypted with the receiver’s private key.

Hash Function — one-way function that digests data of arbitrary size to fixed sized value (number). The function cannot be reversed and is deterministic — with given input it always generates the same output.

Secure Multiparty Computation — has the goal of creating methods for parties to jointly compute a function over their inputs while keeping those inputs private.

Homomorphic Encryption — type of encryption allowing some arithmetic operations to be done over the encrypted values in such a way that when decrypted they give the correct result.

Partially Homomorphic Encryption — allows only some operations for example only addition, or only multiplication.

Fully Homomorphic Encryption — supports arbitrary computation on ciphertexts.

General Protocol

All listed protocols here follow the same steps:

Identify customer data.
Unify data fields.
Hash the data.
Perform private set intersection.
Report the final results.

These steps are discussed in details in the following chapters.

Customer Data Identification

We define two types of customer data to:

Personal

Names — John Smith
Date of Birth — 09/02/1978
Address — 66 E. Surrey Drive, Lexington, NC 27292
ID — 606958274
Email — john.smith@…com
Phone — 607–209–111

Non-Personal

Purchase History
Login History
Preferences
Meta data

The organizations need to agree on the chosen data fields, in order to compare their customers. The field(s) chosen need to be:

As few as possible since more fields will increase the amount and speed of transferred data. Also if some customers may exist in both sets but with slightly different information, e.g. same names, different email.
As unique as possible. If there are many entries with the same data it will be useless to compare because there will not be a guarantee that the matches found are correct. For example, if customers are compared based on the birth date. (see birthday paradox).

Data Unification

For the customer data to be compared, both organizations need to agree on common data formats and to unify their data sets according to that.

For example, if the parties are comparing customer phones to find a customer in common, they may have the phone numbers stored like this:

607–108–899 for organization A
+64607108809 for organization B

Another example is the name of the customer:

John Smith for organization A
JOHN SMITH for organization B

They both need to be in the same case in order to be compared correctly.

Since in most of the cases we will compare encrypted data (which cannot be additionally formatted) we need the unification to be done by each organization prior to comparison.

Data Hashing

As mentioned above, hashing or applying a hash function to the data can convert it to a single number.

It has the following important properties:

Deterministic — will generate the same output every time specific input is given

One-Way — having only the hash makes it impossible to get the original data
Uniformly distributed — the numbers generated by the hash function are truly random
Collisions — it is very unlikely to find two messages generating the same hash
Changing a single bit generates entirely new hash, indistinguishable from a random number

We will use that function in all scenarios for several reasons:

Hash function provides some privacy when applied to personal data compared to the plain version of that data
We can combine multiple data fields with unlimited length into a single hash, thus fixing the length of the data to be compared
If we add additional source of entropy to the hash it becomes secure enough to be transferred to a 3rd party which will not be able to see the raw data

Many different hash functions exist each with different pros and cons however the de facto standard is the NSA family of standards Secure Hash Algorithms (SHA2, SHA3). Most popular among them is SHA256 — which accepts arbitrary input and outputs 256 bit (32 bytes) digest.

Examples

Let’s have a customer’s phone such as: +64607108809

What you can see as a result from the hash function is the HEX representation of the 256-bit number generated. The decimal representation is: 7.247375870513271e76

Now we can take the hash and compare it to the data of the other organization. If we find a match we will know which phone it was before hashing and we will know that we have a common customer and who he is. This may seem like a good idea although it has some issues we will demonstrate later.

Let us see some more examples:

We have the name of a customer such as: John Smith

Usually his data will be in multiple data fields such as FirstName and LastName. Also during the unification process his names may be capitalized: JOHN SMITH to match both organizations databases. Finally we need to represent it as a single string so we concatenate first and last names and put a dot between them so the final name becomes: JOHN.SMITH

However, there are many John Smiths (47,732 people in the U.S. to be precise). Therefore, it may be a good idea to add some other data field to ensure that the other organization has the same John Smith in its database as our customer. We add his birthdate:

Again we may think we have found a good combination of data fields to uniquely identify a user but due to the birthday paradox there is still a high probability that we will find two or more customers, who share the same name and birthday. Hence it is considered good practice to add unique fields as identifiers, such as the phone number field. However, sometimes phone numbers are reused by mobile operators for different users so we need to combine it with a name for example.

To Be Continued …

In the next Part 2 — Private Set Intersection, I will review some of the algorithms to do PSI and will show examples.