Calculating Similarities in Identities at Scale

Published in

SailPoint Engineering Blog

5 min readFeb 25, 2022

CPUs and GPUs process tasks in different ways. Regarding interrelations, they are often compared with brain and brawn of computing, respectively. A CPU (the brain) can work on a variety of different complex calculations, while a GPU (the brawn) is best at focusing all the computing abilities on a specific task — we can think of the CPU as the Swiss army knife of computing and the GPU as the machete. The Swiss army knife is helpful for many different things from cutting a piece of paper to opening a can of food. The army knife is a powerful tool designed to maximize the performance on a range of tasks. On the other hand, you probably wouldn’t want an army knife when you have to find your way through a jungle. You would rather have a machete which is not as advanced a swiss army knife but can do its own simple task very well. Now, let’s put this in computing terms.

A CPU is designed to maximize the performance on either a single or a few tasks in parallel, but the range of tasks is wide. On the other hand, a GPU uses thousands of smaller, more efficient cores for massively parallel architecture aimed at handling multiple, simpler calculations at the same time. GPUs are a lot less versatile than a CPU even though they can process data several orders of magnitudes faster than a CPU. A traditional CPU has about 16 cores and in a server environment it usually has about 24–48 cores, but on the other hand a single GPU can have as little as 2000 cores. This makes GPUs very efficient for repetitive and highly parallel computing.

GPUs in Identity?

Until now we have mostly heard about GPUs for video games, video editing, artificial intelligence, and now cryptocurrency. So, why are we discussing them in the world of identity governance (IGA)? The answer is simple: we must do some complex calculations in the IGA world as well and this is where the GPU comes in. One such example is calculating similarity between identities using Jaccard Similarity metric, for example. This metric is a measure of how similar two sets of data are. The value for this metric ranges from 0 to 1, where values closer to 1 means higher similarity and values closer to 0 means lower similarity. We calculate Jaccard by calculating the common items between two sets of data and dividing it by the total number of items in both datasets. In IGA terms, we can define similarity between identities based on the entitlements that they share.

How It Works

Jaccard Similarity = Count common entitlement / count unique entitlement of the two identities

For example, we have two identities: Bob and Anne. Bob has entitlements [a,b] and Anne has [b,c]. From the above picture, we can see that entitlement in common is b so count of common entitlement is 1 and total unique entitlements is 3. Here the Jaccard Similarity is 1/3 or 0.33. We can use GPUs to compute Jaccard using matrix multiplication.

Jaccard Similarity Using Matrix Multiplication

The idea is to compute the Jaccard similarity between identities using matrix multiplication. When we perform matrix multiplication between a matrix and its transpose, we calculate the intersection between the rows of the matrix. For example, we have 4 identities and 4 entitlements. In the following table, entitlement columns with a value of 1 means the identity has the entitlement and a value of 0 means identity does not have the entitlement.

Now, if we treat the above table as a matrix and multiply it with its transpose, the output would be this:

As we can see from the first table, Bob and Anne share 1 entitlement in common. So, the matrix multiplication gives us the entitlements in common between two identities. To compute Jaccard Similarity from the matrix multiplication, we use the following formula:

So Jaccard Similarity between Bob and Anne will be:

Jaccard Similarity Using GPU

As mentioned in the beginning we can use a GPU to do simple computations in a massively parallel system. Since we know that we can compute Jaccard similarity using matrix multiplication, we can use a GPUs core to parallelize the matrix multiplication system.

Let’s assume that we must calculate Jaccard among 100,000 identities. Now, to compute Jaccard between all possible identity pairs we will have to do about 10,000,000,000 computations as Jaccard has time complexity of O(n2). The computations can be reduced to 5,000,500,000 with smart logic though. This should be easy for a system, right? Yes! It will be very easy, but too time consuming. If we do parallel processing on 16 core CPU where each core can do 100 computations per second, 5B computations will take about 36 days to finish.

Conclusion

Earlier we discussed the significant power difference a GPU and its parallel computing capabilities can have when doing simple, repetitive tasks. Multiplying a matrix seems like a perfect problem to use GPU on. A normal, cheap GPU on cloud can have up to 2,048 parallel processing cores instead of a simple 16 cores CPU. By using the same logic from before let’s assume one core can still do 100 computations per second, the GPU will take about 6 hours to finish this computation. The difference between CPU to GPU time is 36 days vs 6 hours, respectively. Not bad!

GPU scaling comes with its own set of problems. For example, CPU memory is cheap and can extend into hundreds of gigabytes, but GPU memory is significantly more expensive and limited, often to 8GB or 16GB for a single GPU. What this means is that we cannot just put an entire, big dataset into the GPU memory otherwise it will overload and fail. We need to do some fancy optimizations and advance matrix multiplication techniques to make sure we can compute the Jaccard on limited GPU memory. These optimizations would allow us to turn that 6 hours into 6 minutes — stay tuned for that!