Sitemap
Feedzai Techblog

Welcome to Feedzai Techblog, a compilation of tales on how we fight villainous villains through data science, AI and engineering.

Light Speedy Profiling to Fight Fraud

By and

15 min readDec 22, 2020

--

Press enter or click to view image in full size
Image created from this graphics repository

To catch fraud in the messy world of real-time online transactions, the name of the game is effectiveness, speed, reliability and lightness. At Feedzai, we strive to push the boundaries of our streaming engine solution to allow for even lighter ways to process events, while encoding all necessary information in the profile features that fuel our fraud prevention ML models. In this post, we dive into the world of profiles and discuss state-of-the-art strategies used to compute them.

If you’re wondering “but what are profile features?” or “how do they help a ML model?” or even “how lightly can they be computed?” then you’re in the right place! Together we will understand:

  1. How profiles can help capture the dynamics of fraud,
  2. How they can be estimated in many different ways (some lighter than others),
  3. The wide range of engineering challenges that they pose.

In a previous post of the Feedzai TechBlog, we discussed Railgun, a new streaming engine which aims to provide accurate profiles by storing events efficiently. In this post, we focus on a complementary approach that removes the storage cost by using alternative profile estimates.

Profiles as features that capture dynamics

The concept of profiling is quite common in our digital era. Most of us will have accounts in several social networks. Take Twitter, for example. If you are an average Jane/Joe you probably tweet once or twice a day. If, instead, you are a famous artist or a blogger, maybe you tweet 10 times a day, to keep your fans’ interest alive. This example of tweets per day provides a very simple illustration of a profile.

Profile features can characterize types of users or, more broadly, entities/groups/categories. In domains dealing with real-time data, profiles are computed through some aggregation, over a time period, for a given entity — the group by field. In the example above the aggregation is the count of the number of tweets, in the period of one day, grouped by user. A very useful property of profile features is that they are able to encode dynamical information. Before going into that, let us first discuss how to define profiles in more detail.

Sliding window profiles

Press enter or click to view image in full size
Representation of a sliding window used to compute a profile.

The figure above represents a stream of events (dark green circles) distributed in time. Consider, for example, the fraud prevention domain, where the events could be transactions. Whenever a new event enters the system, the sliding window (e.g., containing the last hour of events) moves forward in time to contain the new event. In the figure, when an event exits the window, the corresponding circle drops down to represent that it no longer contributes to the aggregation. At any time, the window may contain events for several cards as illustrated in the following figure:

Press enter or click to view image in full size
Example snapshot of a window state at a particular time.

We represent events for three different cards with different colors. For this particular snapshot we have cards with two, three and one transaction. Thus, in addition to a window with the events, we also keep track of a state containing the value of the aggregation for each of the cards (here a count). Therefore, when the newest event, on the right, arrived:

  1. The window slid in time to the right to add it,
  2. The faint red event (leftmost) was expired from the events in the window, together with the corresponding card, D, from the state (whose count dropped to zero),
  3. The count for card A increased by one.

In the same way as card D was expired, if a new card appears in the next event, then that new card will be appended to the state.

From this example, we can see that, for sliding window profiles, we have:

  • Memory requirement 1: Save all the events in the window.
  • Memory requirement 2: Save a dictionary of all cards in the window with their respective aggregation values.
  • Processing requirement 1: Events and cards have to be inserted when arriving at the system and the aggregations have to be updated.
  • Processing requirement 2: Events and cards have to be expired when they leave the window and the aggregations have to be updated.

In real applications, a ML model will, typically, make use of a few tens or hundreds of profile features. They will cover periods ranging from a few seconds, days, weeks or even months. This helps the model to learn the patterns of behavior in different time scales (more on that in the next section). In most fraud use-cases, millions of transactions for thousands of cards are processed every day, thus requiring considerable amounts of resources such as CPU and storage.

However, transactions need to be processed in a small fraction of a second, so that the user experience is not degraded while keeping hardware costs to a minimum. Therefore, finding more efficient ways of computing profiles that may save storage and/or processing time (latency) is a challenge of great importance. In the next section, after a small detour into the dynamical interpretation of profiles, we will discuss a strategy that achieves both goals.

Alternative profiles

A detour into Poisson processes

To better understand what profiles are really measuring, and to understand how they can capture the dynamics of fraud (or of any other domain), let us discuss a statistical perspective of the data stream. A very natural interpretation is to see the stream as being generated by some stochastic process where, at any given time, there’s a certain probability (per unit time) to receive a new transaction — this is usually known as the Poisson rate, r.

If we imagine time as being discretized in small time steps δt, and if, for simplicity, we take r as being a constant, then: i) the probability to receive a new event at a given time is rδt, and ii) the probability of not receiving a new event is 1-rδt. You can think of it as a coin toss, at each time step, where the probability of heads (observing an event) is not necessarily the same as tails (not observing an event). The following animation illustrates this perspective:

Press enter or click to view image in full size
Schematic representation of a Poisson process in discrete time steps. At each time step δt a binary decision is made to generate (or not) an event, based on a probability rδt.

In reality, time is not discrete and, strictly, we need to take the limit δt→0 to describe the process in a mathematically rigorous way. But in practice, we can indeed simulate a Poisson process, numerically, using small time steps. Poisson processes are used to describe a wide range of phenomena, including radioactive decay counts, earthquake occurrences, trees appearing and disappearing in a forest, or really almost anything that relates to streams of events, such as cars flowing through a street or customers arriving at a store.

A count profile, as described earlier in the example, provides a way of estimating the Poisson rate. If r is constant, all it takes is really to count the number of events in the window and to divide it by the window size. But… wait! Then this means that computing a count profile in the last hour or last day will, in expectation, give us exactly the same Poisson rate! Then one may ask:

What’s the point of having various count profile windows?

The answer is that, in reality, for many of the examples above, the rate r is really not a constant, but rather a function of time r(t). In the fraud domain, this happens for many reasons such as:

  • fraud patterns change frequently (because fraudsters adapt their strategies to try to circumvent our systems),
  • the markets themselves change,
  • seasonal changes (e.g., the volume of transactions will change according to the time of the day, or the specific day in the week/year).

Then, count profiles with different window sizes (e.g., 10 seconds, 1 hour, 1 day, 1 week) in fact provide information on short and long term properties of the non-stationary process that generates the transactions, giving us a more complete view of r(t). Furthermore, by grouping by some field, like card, the profiles of each card help to characterize the individual (non-stationary) Poisson rates for that card. This is essential for the ML model to learn the patterns of behavior on different time scales.

The event rate information captured by r(t) is, however, only a small piece of the information that the ML model uses (otherwise we could catch fraud just by counting events!). Thus, ML models also need profiles involving aggregations that estimate changes in the distribution of feature values. For example, a fraud event may be characterized by an unusually high rate of events, but also by transaction amounts that are suspiciously large compared to normal customers (e.g., measured by the average amount), or by the fact that the range of values transacted is unusually similar, or diverse (e.g., measured by the standard deviation of the amount).

Thus, to completely characterize the process, besides r(t), one can define a time dependent probability density function p(X,t), of observing a transaction with features X if an event is generated at time t. Then profiles such as the average or standard deviation of feature values, will provide information on the underlying p(X,t). If we compute a set of profiles that is diverse enough (covering aggregations over many features and several window sizes) we have a good chance of characterizing well both r(t) and p(X,t), and, as a consequence, the underlying stochastic process. In sum:

Profiles provide an excellent way of encoding information on the dynamics of the data stream — both its frequency and its evolving distribution of features.

And that’s a great way of making a ML model happy 😄.

The anatomy of EMAs

We have seen above that sliding window profiles help us to estimate the dynamical properties of our data stream. We have also seen that sliding window profiles require storing all the events in the windows, and processing events to update aggregations when a new event arrives or whenever one is expired from any window. Given that the natural moment to process events is when they arrive at the system, a question arises:

Is there a way to estimate profiles by processing events once without storing them?

It turns out that the answer is: Yes! We now describe a strategy based on the well known method of Exponential Moving Averages (EMAs).

The basic idea of EMAs is very simple to visualize when applied to profiles. Consider the following figure of the events we used earlier to illustrate sliding window profiles:

Press enter or click to view image in full size
Schematic representation of the EMA weighting strategy of the events.

Now, instead of including only the events in the window, say for the last hour, we include all past events, but we weigh them down by a factor that decays exponentially fast into the past. The EMA tail will have a lifetime τ, that controls how fast older events are forgotten so that they contribute less to the estimate. The following figure illustrates more explicitly what happens when a new event arrives at the system compared to a sliding window:

Press enter or click to view image in full size
EMA weights compared to sliding window weights.

The sliding window (rectangle) and EMA (exponential tail) at time t are represented in fainter colors. After the newest event arrives, at time t0, both the sliding window and EMA move to the right. The sliding window only contains events inside it whereas the EMA contains all past events, but their contributions are suppressed by an exponential factor. The nice (recursive) property of EMAs, is that the new value of the aggregation after the new event arrives, is given by a sum of the old aggregation multiplied by a common suppression factor (the effect is represented by the red arrows and the formula with the exponential) plus a contribution from the new event. For example,

Count(t0) = 1 + Suppression(t-t0) * Count(t)

where the 1 is the contribution from the new event.

This procedure can be easily generalized for many other EMA aggregations. The bottom line is that we have eliminated memory requirement 1 (because the incoming event is only used once) and processing requirement 2 (because the suppression factor implicitly takes care of forgetting events, so no expiration is needed).

This recursive nature of EMAs makes them very efficient!

The trade-off is that an EMA count profile is no longer a conventional integer count, but a sum of exponential weights. In spirit, this is not so different from a sliding window count profile where the weights are just 1 (inside the window) or 0 (outside the window). The beauty of the EMA approach is that, as long as you still have several EMAs with different lifetimes, you are equally able to estimate the properties of your underlying process without hurting your ML model (to be further discussed).

Delayed EMAs

Another type of EMAs that we studied are delayed EMAs. These are motivated by the corresponding delayed sliding window profiles, which can be useful to compare different periods of activity. For example, say you want to compare the activity of a card on the last day with the activity on the day before. The ratio of these two count profiles may give you a good hint of recent unusual activity. For sliding windows, the second profile is computed with a delayed window shifted to the previous day.

It turns out that we can also exclude the effect of the most recent events, to produce a delay, with EMAs. This is done by subtracting an EMA profile with a shorter lifetime from an EMA profile with a long lifetime. The following figure illustrates this:

Press enter or click to view image in full size
Delayed EMA weights (dark shaded area). Note that the dark shaded area is obtained by subtracting the short tail EMA (faint gray) from the long tail EMA (faint green).

We can see that the final weight is zero at the most recent event, then rises to its maximum to the left, and finally it tails off to zero for older events. We can tune the shorter lifetime to suppress the most recent events, as appropriate, and give higher weight to events in the delayed period. The short and long lifetimes can be thought of, respectively, as the counterparts of the delay and the window size of a delayed sliding window.

EMA “all the things”

Press enter or click to view image in full size
From https://imgflip.com/memegenerator/X-All-The-Y

That’s it! You now understand all the main ideas used to implement EMA based profiles. This can be applied to all sorts of aggregations that are useful for ML models. We have done so in our experiments (discussed below), where we used delayed and non-delayed versions of EMA counts, sums, averages, standard deviations, etc.

Performance, performance, performance!

The final hurdle in this project was to apply these ideas to real use cases to find out if we could preserve the Data Science (DS) performance of our ML models, while reducing Engineering costs (storage) and maintaining or improving on latency.

In our experiments, we used three types of datasets. Namely, transactions from a banking dataset, an online merchant, and a payments processing platform that aggregates many different merchants. The idea was to cover many different domains in the DS experiments to give us extra confidence. In the Engineering experiments, we focused on the banking dataset (with around one million transactions per week), which was the heaviest. It contained around 200 different profile features with many aggregations, as already discussed.

Data Science performance

From a ML model perspective, all that is required is that the target Data Science performance metrics are not degraded by replacing sliding window profiles with EMA profiles. In our experiments we found that, for all data sets, the DS performance was indeed very similar. In the following table we show an example comparing sliding vs EMA. We include the ROC area under the curve increment to illustrate the results. We show results for various models (one per merchant type) trained with either of the profile options:

Press enter or click to view image in full size
AUC of ML models for several merchant types from a payments processing platform.

We can see that this metric is very similar or better (in green) for EMAs. We have also verified that the conclusions hold for more relevant business metrics (e.g., recall at a given low false positive rate required by the business).

Engineering performance

While EMAs sound amazing in theory, implementing them in Feedzai’s system was not completely trivial. This is because cards stored in EMA windows will technically never expire. However, in practice, when a profile for a card has been suppressed for a long enough time (because the card was not seen again), its EMA profile value will be very close to zero and it will not longer encode any useful information, so we may as well discard it and restart the EMA computation when it is seen again. Instead, if we were to use a naive implementation, the amount of such cards would grow forever, thus blowing up the memory usage.

This observation allows us to add the notion of a key time-to-live (TTL) and apply expiration policies (here the key is an identifier for the group-by field, e.g., the card). The extra information we need to save in memory is the timestamp when a card will be eligible for expiration if no new events arrive for that card in the meantime. The following dictionary is an example of such TTLs:

{
cardA: Wed, 11 Dec 2019 17:18:03 GMT,
cardB: Thu, 12 Dec 2019 20:15:50 GMT,
cardC: Thu, 12 Dec 2019 22:17:09 GMT,
… ,
}

In our experiments, we found that a good criterion to preserve the DS performance and to get the engineering performance gains that we discuss below, is to use a half-life for the EMAs that is 1/3 of the sliding window sizes and a key TTL around 4x to 5x the half-life. Note that the half-life is defined as the time passed until the EMA suppression factor of an event is 1/2 or, equivalently, the lifetime when we use base 2 for the exponential.

Astute readers may realize that the data structure above, while useful to detect old keys, does nothing to save memory unless we introduce a mechanism to expire keys. To tackle this, we note that events enter the system via a single thread in a time-ordered fashion, and similarly in the data structure. Therefore, if there are keys to be expired, they must be the oldest in the data structure. This pattern lends itself to a linked list-like data structure. With this information we created a custom dictionary that behaves like a linked list, with:

  • O(1) time-access to the head of the dictionary,
  • O(1) time-access to an entry by its key,
  • O(1) time-access to the next entry given an existing one.

The last property is important to implement a garbage collection logic, where, if the oldest key can be deleted, we will also try to delete the next oldest. This ensures that the dictionary will tend to have all possible keys expired, if the event rate does not vary too much for the whole dataset. We only perform at most 2 lookups per event to distribute the processing involved and avoid localized latency peaks.

Similarly to the DS experiments, to validate our approach, we used a real-world banking dataset with a ML model and converted all sliding window profiles to their EMA equivalents. The idea here was to replicate a realistic production model scenario. We used a streaming engine with in-memory storage, though, in theory, similar storage gains should be possible if EMAs are implemented in a more sophisticated streaming engine, such as Railgun. We analyzed the memory consumption and the latencies of both the sliding window profiles and the EMA profiles, in the same environment, for a variety of heap-sizes (our systems are JVM-based), various event loads (transactions / sec) and various EMA configurations.

Regarding memory, we were able to comfortably reduce our heap usage to half. These savings can be even larger if the cardinality of the group-by field is smaller (in a banking dataset we can easily be dealing with millions of cards in a few weeks). For profiles with a small number of distinct group-by values (e.g., hour of the day [0,…,23] or country code) the memory savings are even more massive (thousands of times better or more).

Despite the extra key-expiration garbage collection logic introduced in the critical path of our event processing, we found that EMAs are faster to compute than their standard counterparts. Between two to three times faster, in fact, as observed in the following figure:

Press enter or click to view image in full size
Operator latency improvement ratio (sliding window latency/EMA latency) at various percentiles in the distribution of latencies for our engineering experiments.

These savings exist because we no longer need to manage events in the window (only keys) so no event insertions and, most importantly, expirations exist. Even better, our systems become more resilient to event bursts like fraud attacks or high-volume events (e.g., Black Fridays or product launches) which were noticeable in the higher latency percentiles (99.99% and beyond) for sliding window profiles.

Conclusion

We have introduced EMA profiles as an alternative to sliding window profiles for ML model feature engineering to fight fraud. Our experiments showed that the DS performance can be left untouched with significant memory and latency savings when processing events in real time. We eliminated the need to save events (a property of EMAs) and introduced a custom data structure to manage group-by keys with a suitable expiration policy. We also found that the EMAs’ latencies distribution is incredibly stable, because aggregations adjustments for events leaving the window no longer exist.

--

--

Feedzai Techblog
Feedzai Techblog

Published in Feedzai Techblog

Welcome to Feedzai Techblog, a compilation of tales on how we fight villainous villains through data science, AI and engineering.

Responses (1)