Hello Streaming, fancy meeting you here

Gabriel Candal
Feedzai Techblog
Published in
5 min readAug 1, 2018

At Feedzai we work daily on products that have to cope with large amounts of data for different use cases, which have contradicting requirements.

In the beginning there is the data science loop: the first step in tackling fraud is understanding it.

In this phase someone, either our teams or the client’s, will have to look at what historical data there is regarding the problem there is to solve, clean it, enhance it, train machine learning models, maybe write some rules on top of it and keep iterating on this whole process until he is fairly certain he is reasonably confident those same techniques would work in a production environment.

If you are processing data in a production environment that has demanding performance requirements on how much time you have to process a new event (such as deciding if you should approve a given payment or block it since it’s most likely fraud, which has to be done in a few milliseconds) the most well-known data processing solutions, such as Hadoop’s MapReduce, Apache Spark or data warehousing solutions are not ideal because they optimize for how many events can the system process per second which is great if you are working on data science exploration but not so much if your customer is looking at a loading screen waiting for your system to make a decision.

This subtle difference has deep repercussions on what data processing engine should yo use. Would you want to optimize for the former or the latter? You have to be conscious that there is even a choice to make, otherwise the cost of change might be too great.

This is the typical software engineering struggle between throughput and latency.

This tradeoff appears in a number of scenarios but can be summarised by: generally if you group stuff together in batches and process everything at once it will be faster to process everything but you will take longer answering to some of them. Consider that you are working with batches of 10 elements: the first element will have to wait for the 10th to arrive before being processed (hurting latency) but this technique yields a performance gain because you don’t have to pay the processing overhead for every single event, you can do it once every 10 events (boosting throughput).

This is where streaming comes in.

A streaming engine is one where you have the ability to process events in an individual basis in order to have the best latency as possible; this is the technique used in production by Feedzai’s processing engine.

People tend to add other concepts into the mix but the only thing that tears apart batching and streaming processing is if you’re optimizing for throughput or latency. It’s also important to bear in mind that this is a spectrum: you can balance your trade-offs between one or the other without fully committing to one extreme (it’s not 1 or 0, you can pick something in between). Modern data processing engines such as Cloud Dataflow or Apache Flink will allow you to walk this spectrum on a per-job basis and are not tightly coupled to any of them (Spark is also trying to break from the old paradigm with its structured streaming initiative).

These other concepts I was mentioning usually resonate better with either batching or streaming but let’s look at each of them one by one to understand why they don’t have to be exclusive.

It’s not obvious what’s the difference, in terms of data processing, between data science exploration and detecting fraud in a production environment. In both cases the system will have to consume raw data, apply some transformations on it and spit out some kind of result. Well, what tears them apart is the nature of the data they’re consuming. Whereas in the data science loop we presumably have all the data we need from day 1 (bounded data), in a production environment the data will come one event at a time (unbounded data).

For the first case a batching engine seems more suitable: you already have all the data so you can just make an huge batch so that the job takes as little time as possible. This doesn’t mean that a streaming engine couldn’t process the very same data, it actually seems trivial: it will just a record at a time. For unbounded data most likely you want to answer as soon as possible and that’s why usually streaming engines are used for it but batching engines could also do the job. For instance, Spark’s structured streaming uses micro-batching: it groups a small number of records together and runs a job on them.

Another dimension to be considered is correctness. While batch systems, because they are usually used with bounded data and in scenarios with not-so-strict latency requirements they tend to offer fully correct results: a count returns the exact count, events are processed exactly-once, etc. On the other hand streaming systems often sacrifice correctness for the sake of performance: they can use HyperLogLog (or similar) to count distincts, use approximate windows to avoid having to save all events that would belong to a window or similar techniques. But again, this is a concept which is orthogonal to what kind of data processing engine you have: you can use streaming with full correctness and batching with approximate results.

Summarizing: the type of the data processing engine sometimes may limit the performance of your system on a conceptual level so be sure to either pick a flexible engine or be aware of the compromises you might be making.

--

--