Finding The Common Customers of Two or More Organizations While Keeping Personal Data Secret (Part 2)

Galin Dinkov

Published in

Mimirium

11 min readFeb 3, 2020

Part 2/3 — Private Set Intersection

Previously

This is the second of the three articles where I discuss the the problem how to find who are the common customers of two or more companies without revealing any information about these customers. The full whitepaper can be found here. And if you want to read the introduction to the problem read the first part first.

This part will discuss some of the cryptographic algorithms performing private set intersection (PSI).

Naïve Algorithm — Hash Compare

The idea is to prepare hashes from some data fields in the data sets of both organizations and compare them.

Protocol

We have two organizations and each of them has a database with customers.

The steps will be:

1. Both sides need to agree on which data fields they are going to compare.

We choose FirstName + LastName + Phone

2. Prepare and unify the customer ID based on that fields.

Concatenate the data fields, separated by dots

3. Hash all fields using SHA-256 hash function

4. Now since we have our hashed data we need to compare them with the other party. So here the options are two use a 3rd party to compare the data or to exchange it directly with the other party.

Below are pros and cons for both options:

5. Perform the intersection

6. Mark the matches and prepare the report

Security

This method is very easy fast and implemented already in the practice though its security is not very high.

The biggest problem is that even the hash function is one-way which means we cannot de-hash the digest we have and get the source customer data, we still can achieve that but putting a little more effort.

There are two attacks an adversary can perform over the hashes — brute-force attack and rainbow table attack. We will demonstrate the brute-force attack.

Unlike when hashing general random data, here we know what kind of data we are hashing (this is known to both parties and adversaries too, since they know the protocol).

We know we are hashing FirstName + LastName + Phone. So here to find the source value (customer data) of a hash we need to try all possible (or enough) combinations between these data fields. This may look like a hard problem but actually it is not.

According to the web site How Many of Me

“There are 330,136,629 people in the United States of America. The U.S. Census Bureau statistics tell us that there are at least 151,671 different last names and 5,163 different first names in common use in the United States. Some names are more common than others.”

So if somehow we have the names of all US citizens the list will be with about 330 million names.

If we don’t we can calculate the total possible name combinations to be:

In the USA the formatting convention for phone numbers is NPA-NXX-XXXX, where NPA is the three digit area code and NXX-XXXX is the seven digit subscriber number

So very roughly we need to find all permutations of 7 digit numbers which is:

And the total number of all possible first name, last name, phone combinations will be:

Yes, this is a huge number. And we have not only to find all the combinations but also to calculate the hashes of these combinations.

(Un)fortunately there is an ASIC microchip called Antminer for about $500 that is able to produce about 14 TH/s (terahashes per second) or 1,000,000,000,000 hashes per seconds.

That makes the total time for calculating all possible names / phone combinations to:

And once we have all these combinations we can save them along with their hashes, thus creating creating a rainbow table to compare in the future (without the need to brute-force again).

The conclusion is that either we need to add many fields, some random numbers or other sources of additional entropy that will make the brute-force harder or impossible or we need to change the algorithm. Keep in mind that both organizations need to know the entire protocol and have all the parameters for it.

The algorithm could be extended to use cryptographic salt, pepper or HMAC , however additional communication and negotiation between the parties has to be performed.

Based on Diffie-Hellman Key Exchange

Diffie–Hellman key exchange is a method of securely exchanging cryptographic keys over a public channel. The following diagram demonstrates its principle :

1. The process begins by having the two parties, A and B, agree on an arbitrary starting color that does not need to be kept secret in this example, the color is yellow.

2. Each of them also selects a secret color that they keep to themselves — in this case, red and blue-green.

3. The crucial part of the process is that Alice and Bob each mix their own secret color together with their mutually shared color, resulting in orange-tan and light-blue mixtures respectively, and then publicly exchange the two mixed colors.

4. Finally, each of the two mixes the color they received from the partner with their own private color. The result is a final color mixture (yellow-brown in this case) that is identical to the partner’s final color mixture.

If a third party listened to the exchange, it would only know the common color (yellow) and the first mixed colors (orange-tan and light-blue), but it would be computationally difficult for this party to determine the final secret color (yellow-brown). In fact, when using large numbers rather than colors, this action is computationally expensive: It is impossible to do in a reasonable amount of time even for modern supercomputers.

A cryptographic explanation of the protocol can be done using multiplicative cyclic groups.

Parties A and B (Alice and Bob) agree on a finite cyclic group G of order n and a generating element g in G. (This is usually done long before the rest of the protocol; g is assumed to be known by all attackers.) The group G is written multiplicatively.
Alice picks a random natural number a, where 1 ≤ a < n, and sends g^a to Bob. (g^a here means g to the power of a)
Bob picks a random natural number b, which is also 1 ≤ b < n, and sends g^b to Alice.
Alice computes (g^b)^a.
Bob computes (g^a)^b.
Both Alice and Bob are now in possession of the group element gab, which can serve as the shared secret key. The group G satisfies the requisite condition for secure communication if there is not an efficient algorithm for determining gab given g, g^a, and g^b.

Protocol

So getting back to the customer data sharing, Taking the same data from Hash Compare we have:

To ease the computation we again hash the data fields:

Now we perform the diffie-hellman exchange, but instead of sharing a secret key with the other party we exchange the hashed data fields.

1. First both organizations need to agree on the parameters: generator gand modulo n. It is important to mention that since we will use sha256 hashes for customer data and this will be 256 bits in length and we have this requirement 1 ≤ m < n that means that n > 2^256.

2. Then they choose their random private secret a and b

3. And then they use it to encrypt their customer data

4. They exchange their encrypted elements with each other and exponentiate them with their corresponding secrets.

Using the commutative property they achieve the same private exponent and if they have the same customer the result will be equal. At this point they have their data encrypted in the same way but they haven’t compared the encrypted values.

5. Compare — again this can be done using or not using a 3rd party. However this time the 3rd party is not necessary to be trusted since it will not know anything about the keys, and even one of the parties is dishonest and gives the 3rd party his key it still cannot decrypt the data of the honest party.

Security

The security of the algorithm is based on the difficulty to find the discrete logarithm. It comes from the assumption that for random a, b it will be very difficult to find their values just by having g^a (mod n), g^b (mod n), g^ab(mod n).

NIST suggests at least 3072 bit group to be used for some reasonable level of security.

The algorithm could also be implemented to use elliptic curves thus requiring much smaller keys for the same level of security (e.g. 256 bit for the same level of security as 3072 bit discrete logarithm based).

Based on RSA

RSA is a public-key cryptosystem named after its creators Ron Rivest, Adi Shamir, and Leonard Adleman.

The principle behind it is to find three large primes e, d, n such that with modular exponentiation of all integers m, 0 ≤ m < n:

(m^e)^d ≡m (mod n) and that even knowing e and n or even m it can be extremely difficult to find d.

RSA involves a public key and a private key. The public key can be known by everyone, and it is used for encrypting messages. The intention is that messages encrypted with the public key can only be decrypted in a reasonable amount of time by using the private key. The public key is represented by the integers n and e; and, the private key, by the integer d; m represents the message.

Protocol

Unlike the previous protocols this allows only one of the parties to check the data.

1. Party A chooses its public-private key pair, and gives its public key to party B

2. Party B multiplies its data fields (hashed) with random numbers and sends this to A.

3. A computes a’ and b’’ and gives them to B such as:

4. Next B performs calculations over the A’s set to achieve the same encryption

At this stage B can have the encrypted sets and find duplicates or common customers. B cannot decrypt the data provided by A because it doesn’t have the secret d.

Onе disadvantage of this protocol is that both sides cannot work in parallel to achieve the same sets.

Security

The suggested key length again is at least 3072 bit. The security comes from the assumption that it is hard to find the prime factorization of the modulo n.

Unlike Diffie-Hellman here elliptic curves cannot be used.

Using Fully Homomorphic Encryption

As defined earlier homomorphic encryption allows some arithmetic operations to be performed over the encrypted data. In the case of fully homomorphic encryption all possible operations can be performed.

Example:

Protocol

The protocol can be implemented again using or not using a 3rd party, We will review the one not involving a 3rd party.

1. Both organizations A and B agree on a homomorphic encryption scheme. Also each of them generate public(pk) — private(sk)key pair

2. Each party encrypts its customer data (hashed) with its corresponding public key so the encrypted data will be indistinguishable from a random data for the other party.