Are Streaming Engines lying to you too?
Do you have a use case where you need to process streams of events in real-time and perform temporal analytics where recent events are more relevant than earlier ones?
If so, perhaps you have heard of streaming engines like Flink, Apache Spark, or Kafka Streams. These systems allow you to compute what are known as continuous queries in the form of sliding window aggregations.
Generically, these queries can be defined as a function (aggregator) applied to a collection of data points which change over time as events from a stream enter and leave a window:
These aggregations are rather popular today and their usage is very widespread. Examples include calculating the most popular hashtag in Twitter for trending purposes, IoT analytics or fraud detection.
Have you ever tried to use one of these engines only to find they don’t quite address your problem?
They come in different flavours. Some are in-memory engines and promise low latency at the cost of higher memory consumption but don’t scale. Others promise scalability at the cost of restricting the type of aggregations you want to do.
Choosing a streaming engine is hard! You know it, we know it. It is a fact.
In the end, we just want our metrics to be calculated accurately, quickly, and in a scalable system for our use case. Turns out that choosing a streaming engine is hard! You know it, we know it. It is a fact.
Why talk about this?
At Feedzai we develop real-time fraud detection systems using AI. To be as effective as possible we need to have up-to-date historical data informing our models and rules at every moment. We use sliding window aggregations for this purpose. As you can imagine we also have a streaming engine component as part of our solution which has some challenging requirements:
- Low latencies even at high percentiles (<250ms @ 99.9%);
- Accurate metrics event by event;
And more recently:
- Distributed, scalable and fault-tolerant.
Or what we call the LAD requirements for mission-critical streaming engines.
Recently in the Systems Research team at Feedzai, we were tasked with overhauling our streaming component which is over 10 years old and not distributed (only LA in the LAD definition).
This is one of its biggest limitations so our first instinct was to replace it with a modern distributed streaming engine that met our criteria. Surprisingly, existing engines are not silver bullets and as a result, they can’t meet our criteria despite promising to do so.
In this blog post, we will dive a little deeper into why.
Not every window is born the same way
Depending on your use case you may want to use different types of window slides. There are 3 main types.
- Sliding Windows: As shown in the animation above, these windows slide per event/time unit and are the most frequently thought of and intuitive ones from a business perspective.
- Stepping Windows: Also known as hopping windows, these windows slide every x events/time units instead of every event. These hops mark when a window will be evaluated. I.e: a 10-second window with a 1-second hop means that you will get an updated aggregate value every second, instead of at every event. These are the most commonly found windows in popular streaming engines.
- Tumbling Windows: These windows are stepping windows taken to the extreme where the hop size is equal to the size of the window, where an event contributes only to one window. These windows are usually useful when you want to reset periodic counters, i.e: Total likes in a blog post in a day, reset every midnight.
Popular streaming engines today claim they have sliding window support but in truth what they really have are stepping windows. Identifying these limitations firsthand when considering a streaming engine can be difficult. A tell-tale sign is checking for wording like hops, steps, or micro-batching when describing sliding windows.
Case in point:
Spark Streaming: https://spark.apache.org/docs/latest/streaming-programming-guide.html#window-operations
“A tell-tale sign to understand if a streaming engine supports real sliding windows is to check for words like hop and step in the documentation on windows.”
Why should you care?
While for most streaming use cases hopping windows are enough, there are use cases which need true sliding windows.
Fraud detection is one such case. These systems make decisions over financial transactions, e.g., by blocking a transaction or raising an alarm when money laundering is suspected.
Such systems use streaming aggregations as inputs for models and rules to make decisions. For instance, the following SQL-like statements can be used to profile cards or merchants, and detect suspicious behaviour:
Q1:SELECT SUM(amount),COUNT(*) FROM payments GROUP BY card [RANGE 5 MINUTES]Q2:SELECT AVG(amount) FROM payments GROUP BY card [RANGE 30 DAYS]
If we were to use hopping windows with a 1-minute hop for the first query we would have something like this (the circles represent the time of events coming into the window):
Given this situation suppose you are trying to prevent fraud with the following rule:
If the number of transactions of a card in 5 minutes is higher than 4, then block the transaction.
As you can see, the rule should trigger on the fifth event as it arrives within 5 minutes of the first event, but there is no hopping window including all 5 events in its boundaries using a 1-minute hop.
You could argue that a 1-minute hop is simply too large and that we should reduce the step size. But what step to use? There is nothing stopping an event from occurring between steps, whatever their size.
Then 1-millisecond steps perhaps? To behave almost like real sliding windows.
Well, what most streaming engines don’t tell you is that by reducing the step size (doesn’t even have to be to 1-millisecond!) you are degrading the latency and overall performance of the streaming engine considerably, to the point where it could become useless to your use case.
In our case, this is a reality. We have strong SLOs (Service-level objectives) where we need to guarantee latencies lower than 200ms at the 99% percentile.
The performance problem with small-hop windows
Besides the functional limitations of not updating on a per-event basis, hopping windows also have non-functional drawbacks with respect to latency, CPU usage and state scalability.
Recall Q1SUM(amount), COUNT(*) GROUP BY payment [RANGE 5 minutes]
implemented with a 1-minute hop as described in the previous section. Any event for this window affects five window states, and in this case, two variables (the total summed amount and counter) per-window state. Every minute and, for every card active in the last 5 minutes, new variables are created and the oldest expired.
This property makes hopping windows interesting when the ratio windowSize/hopSize
is low, as it is independent of throughput. When the ratio is higher, hopping windows start having problems as the number of concurrent states that have to be updated increases. Suppose for example that instead of updating every minute, we updated every millisecond to ensure we catch-all events. In that case, we would have 300,000 states to update every millisecond, this leads to very very high latencies. In the following blog post you will see experimental results that prove this point.
So why are hopping windows used? The answer is that hopping windows can avoid storing events since the number of physical window states is fixed and exactly windowSize/hopSize
. Arriving events update those window states, but can be discarded once its contribution has been applied. Hence, besides saving storage, these solutions also avoid processing event expiration. The same can not be said for true sliding windows, since event-rate is not deterministic in data streams; there is no upper bound to how many states there are at any given time. This forces true sliding window systems to store events and calculate which ones to expire based on the incoming event timestamp.
So what’s the solution?
As usual, the answer is, “It depends.” We are not gonna lie to you and state that we found the one-size-fits-all silver bullet.
If your use case does not demand accurate, per-event aggregations then hopping windows may be enough for you and allow you to save some bucks on storage cost. This is typically the case for non-mission-critical analytical streaming jobs.
If on the other hand, you NEED 100% accurate sliding aggregations, then I invite you to keep an eye out for the next blog post in this series. There we will go over Feedzai’s solution to this problem, Railgun, and how it compares with existing open-source alternatives like Flink which use hopping windows.
If you liked this post and want to know more about streaming engines and our work check out our paper! If you want to work on this sort of thing feel free to take a look at Feedzai’s career page!
This work was done in collaboration with João Oliveirinha and Sofia Gomes