Streaming Similarity Search for Fraud Detection

Published in

Smyte Blog

8 min readApr 24, 2017

Introduction

Bad actors on the internet often pretend to be who they are not. But if one looks closely at their actions, one can see who they really are. At Smyte, we seek to identify who those bad actors are and stop them in their tracks before they can do you or your company significant harm. In this brief post, we explore one of those techniques: similarity search.

The Problem

There are a large number of bad actor pseudonyms on the internet, but there are a relatively small number of actual bad actors. What do I mean by this? Let us consider a spammer. The person or company that produces the spam is the actual bad actor. The spammer assumes a number of pseudonyms, distinguished perhaps by email address and/or login name. If we look closely at the behavior of these actors, we can see patterns. Do they use the same credit card number? Did they create their accounts from the same machine or around the same time? Etc. If so, then we need only identify one pseudonym and then find other pseudonyms that exhibited similar behavior.

Initial Solution

Our initial solution to this problem was to build a graph. Nodes in the graph represent entities, such as pseudonyms, email addresses, credit card numbers, login IP addresses, etc. We connect nodes in the graph by edges when, for example, we see a pseudonym using a credit card number. This relation graph encodes the past behavior of all actors.

To find pseudonyms of a given bad actor, we search the relation graph one step from the bad actor to a feature and another step back to a different pseudonym. This provides us with a list of possible aliases for the bad actor. We then score the aliases according the the strength of the association. For example, if we have two pseudonyms that use the same credit card number, we can be fairly confident that they represent the same actual bad actor.

In practice, this approach works extremely well. Why? Masking behavior is expensive. So, the more expensive Smyte makes it for bad actors, then the less incentive they have to operate. By focusing on behavior, we find and stop bad actors.

Problems with the Initial Solution

One problem with our initial solution is that it can get rather expensive to maintain and search the relation graph over time. The reason is simple: it continues to grow! The larger the graph, the more expensive it is to store and to query in real time as breadth-first-searches are an O(n) operation. To make matters worse, it is difficult to shard graph traversals like this, so we couldn’t simply add new hardware to improve performance outside of vertically scaling the shards we already had. We ended up being unable to enable this analysis on our larger customers.

Our Current Solution

To address the space problem, we need a solution that does not store all features, but still allows us to find similar actors given their features. In abstract computer science terms, we need to map the high dimensional space of features (e.g. 2³² IP addresses, 10¹⁶ credit card numbers, etc.), into a much lower dimensional space that we could store. We needed a way to maintain an approximation of the feature graph.

Let us shift away from the relation graph abstraction for a moment and consider instead a simple model of an actor and a set of actions that the actor has taken. Pseudonym A logged in from IP address B. Pseudonym A used image C for his avatar. Pseudonym A later used credit card D to make a purchase. We can represent A by the set of actions:

(LOGGED_IN_FROM_IP, B)
(USED_IMAGE_AS_AVATAR, C)
(USED_CREDIT_CARD, D)

Instead of placing these in a graph, we create a representation of these values in a constant space array. With only three actions, this is rather straightforward, but over time the list of actions can grow without bound. With constant space, we cannot store an unbounded number of actions.

Let us assume that we had an array of 64 floating point values to represent all actions ever taken by an actor. The actual number 64 is not important. What is important is that it is a) constant and b) small enough such that we can store the array for each pseudonym without great expense.

How would we map the actions into the array? We resort to one of the old workhorses of computer science, hashing. If we hash each action into a few of the buckets of the array, and increment or decrement the value in the buckets according the the type of action (the first entry in the action tuple) and the value of the action (the second entry in the action tuple), then we can compactly represent the actions. The specific variant of hashing that we employ is known as sparse random projection.

Given two arrays representing two different pseudonyms, how can we compare their similarity? If we consider the arrays to be vectors in a 64-dimensional vector space, then we can use measure the angle between the vectors. This is known as cosine similarity, and is easily computed in constant time from the two vectors (when treating the length of the vector, 64, as a constant).

Finding Similar Behavior

We have a constant space representation of the behavior, but we still have a problem. Given a single vector, how do we find similar vectors in time proportional to the number of similar vectors? Clearly we cannot simply compare each vector to the given vector.

In abstract terms, we need a data structure to find nearest neighbors in this 64 dimensional vector space. Further, we need the data structure to support low cost updates as we ingest the stream of actions that each actor takes. We also need to make real time queries as we update the data structure. Yikes!

Locality Sensitive Hashing Forest

Our chosen data structure is the locality sensitive hashing forest (LSH Forest). While the name is long, the fundamental concept is simple. We hash each vector of 64 floats into a bucket according to the sign bit of each float. Effectively, we quantize each 32-bit float into a single (sign) bit. To find similar items, we look up the 64 float vector for the query actor, quantize it into 64 bits, then find the matching bucket. For each item in that bucket, we find the 64 float vector and compute the cosine similarity between them. We return the items with the highest scores.

This search algorithm has high precision but low recall. In other words, IF we return items, then they are similar, but this approach may not find most similar items.

With the LSH Forest, instead of searching just the original query bucket, we also search and score multiple hash buckets “near” the original query bucket. This is where theory and practice meet. A bad choice of the probing strategy may yield a correct but slow response. We must match a probing strategy and an implementation of the hash table that together provide good performance.

A Persistent and Flexible Key-Value Store

At Smyte we deal with scale. Our clients have tens and sometimes hundreds of millions of users, many of them active daily. We need a storage system that we can scale to that level that is also persistent. We could shard a traditional SQL database by actor, at significant expense in both operational cost and response time. However, we only need a key-value store. We could use a hosted key-value store such as Amazon Dynamo, but hosted services carry expense and suffer network latency. We could host our own Redis servers, but as we have discussed previously, Redis is not suitable for our needs. Instead, we chose RocksDB.

Developed at Facebook, RocksDB is a high performance, persistent, embedded key-value store. By sharding our data across a number of RocksDB instances, we can get both persistence and high performance at low operational cost. Importantly, RocksDB supports very high performance sequential queries. By implementing a probing strategy that largely relies on sequential queries, we get high performance.

Consider a single query for a 64-bit bucket. We may consider the 64-bit value as a key in a totally ordered key space using the canonical (alphanumeric) ordering, e.g. with 5 bits:

… < 00001 < 00010 < 00011 < 00100 < 00101 < …

If we search buckets 00010 (and 00100) in addition to 00011, we know that the items we find will match on at least 3 (2) bits. In fact, it is true that the minimum number matching bits is strictly decreasing the further away we are from the initial probe. This is easy to see because that number is bounded from below by the length of the longest shared prefix. If we consider the key (e.g. 000111) as a path in a binary tree whose leaves are the items, then the length of the longest shared prefix is simply the depth of the nearest common ancestor (nca). As we move away from the initial leaf, this value is monotonically non-decreasing.

Because of this prefix property, locality sensitive hashing actually uses multiple hash tables, where the keys of each additional hash table are simply a fixed permutation of the keys of the original hash table. This trades space for recall.

Our Extensions

The standard locality sensitive forest is built statically. We build ours incrementally in a streaming fashion. As new actions associated with an actor/item arrive, we update the 64 float vector that represents the actor/item and we update the hash tables accordingly, using the RocksDB mechanism for compaction to perform lazy updates.

Expiration

Over time, actors go away. To address this, we tag each item and each hash table entry with an expiration date that we (optionally) update on each addition of a new action to that item. We use the RocksDB compaction feature to lazily purge expired entries.

Age Attenuation

We also (optionally) assign greater importance to new actions. We do this by weighting the actions according to their age. To achieve this, we apply exponential decay to the floating point values when they are accessed.

Importance Weighting

Some actions are more indicative of similarity than others, such as sharing a credit card number. Consequently, we weigh such actions more heavily with a simple three level scheme of “high”, “medium”, and “low”.

Frequency Weighting

Some actions are less indicative of similarity, such as accessing a common IP address. To support this, we maintain (approximate) counters of all actions. Less frequent actions (across all actors for a given customer) are weighed more heavily than more frequent actions.

Performance

Our current solution shards items across multiple RocksDB embedded stores, ingesting new actions at the rate of over 91,000 actions per second per shard. By use of the prefix search features of RocksDB, we are able to store datasets from all our customers together with no risk of leakage.

Queries are broadcast to all shards, with each returning matches of the items that it stores. The client then merges these partial results into a simple priority queue to select the best matches. Read performance is slightly slower than write due to the computation of the cosine similarity score.

Conclusion

One way Smyte stops bad actors is by tracking their behavior and observing similarities. By combining the right algorithms with the right implementations, Smyte achieves high performance at low cost. Check out our website to learn more.