Batch and Streaming in the World of Data Science and Data Engineering
By Keira Zhou, Senior Data Engineer, Capital One
The large-scale adoption of “Big Data” has created a multitude of exciting new job roles and technologies. Data scientists and data engineers have both become key members of many technology teams, a coexistence motivating the debate: streaming or batch? In order to answer the question, it is important to understand what stream and batch mean for both data scientists and data engineers. If data engineers understand more about modeling and data scientists understand more about building a pipeline, it could become easier to make architecture decisions.
One way to think of their differences is an analogy around water delivery: think of batch processing as delivering water to your user using a bucket, whereas streaming is as if you deliver water via pipe, so it’s much more continuous.
Using phishing website prediction as an example, in order to have a model that can make predictions about whether a URL is good or bad, the first thing you will need to do is to extract features from the dataset. This can be factors such as the length of the URL, the age of the domain, or whether there is a prefix or a suffix in the URL.
After a set of features is determined, historical datasets can be split into training and test sets, and iterated using different algorithms or model parameters. Eventually a trained model will be created, which can be saved and later used for prediction.
Traditionally, prediction is used on new data points in a batch or offline fashion. Meaning you will collect all the new URLs coming into your pipeline, and use the model you have and produce prediction results on the whole new dataset. However, nowadays, with advancements in technology, most prediction is done in real-time, by deploying a model into the streaming pipeline and performing model scoring on every data point in a data stream.
Another consideration is how to update a model. The most traditional way is to update a model “offline,” meaning collecting all the new observations, such as all the new URLs users flag as good or bad, and retrain your model overnight, or every week.
Alternatively, “online learning” allows for the model to be updated for every new observation. This method makes the model adapt to new trends faster. Also, when a dataset is too large and training on the entire dataset is impossible, online algorithms can be helpful to train the model incrementally.
There are a few takeaways for online learning. Algorithms used in online learning are usually computationally much faster, and it helps your model to adapt the new trends of the data faster. Couple classification algorithms (e.g., logistic regression) can be used in online learning because of their optimization method — Stochastic Gradient Descent. However, the majority of the algorithms only work in batch fashion. Also, if feature extractions for your model is slow, it may affect the performance of your online learning and prediction. It could also be hard to always get things right in a completely automatic way, since you don’t have full control of the data that goes into your model when you do online learning.
Building a Streaming Pipeline
Should your team decide to use streaming to enable real-time decisioning, it is important to understand that there are several challenges when trying to build and maintain a streaming pipeline.
Figure 1 is a simple streaming pipeline, which usually has a queue system that stores the messages, a processing streaming processing engine such as Spark Streaming, a database and a serving layer where it can be another API or a frontend web app.
Queuing systems, such as Kafka, provides the ability to keep logs of the messages. If your streaming process has some delay, or the Spark cluster goes down, you will not have any message loss because they are all stored in a queuing system.
The nature of data leads us to think about “time.” There are two major ways to reason about time in a streaming pipeline. One is to use “event time,” which is when the event (e.g., user clicked a link on a website) happened. The other one is processing time, or wall clock time, which is when the event message arrives in the pipeline and i9s processed by your program. From the figure below (Figure 2), we can see that some events that happened first, may actually have arrived late in your system.
When you build your pipeline, you should think about whether event time matters to you or not.
Figure 3 shows a spectrum of the use cases from when event time doesn’t matter that much to when event time matters.
One example of when event time doesn’t matter (also called time-agnostic) is a feature in the Capital One Second Look application. It uses machine learning to identify unusual spending patterns and send customers an alert for double charges, a repeating charge that is higher than last month, or a tip that is higher than a customer’s norm. Through the feature called Generous Tips we notify customers when they leave an unusually high tip, for example, 50% of the original bill. This could either mean that our customers really enjoyed their dinner, or maybe there is a mistake. We want to notify our customers while they are still in the restaurant so that they may be able to correct the mistake right away. In this case, the event time doesn’t matter as much as long as we are able to detect the generous tips and notify our users.
In the middle ground are the use cases where we can use approximation. For example, for a credit card company, we may be interested in what top merchants are for our customers in the past hour. In this case, it is okay that if we don’t get an exact count.
Even if some messages get delayed, we probably will still be able to get the list of top merchants. The other end of spectrum is where event time matters. For example, we should use event time for our deduplication process.
When deciding whether you need batch or streaming for your use case, there are couple things to consider:
- How will it affect your model accuracy?
- How much control do you want over your model updating process?
- How much latency do you allow for your pipeline?
- Is it easier to maintain one pipeline rather than two?
If you want to enable real-time decisioning, it’s better to build everything in a streaming fashion, and batch can be built on top of streaming (Kappa architecture), rather than have a batch process alongside a streaming process (Lambda architecture).
In summary, the challenges of a streaming pipeline come from both the data stream itself, as well as your infrastructure. Streaming gives you more timely results with a cost. However, not every application needs to be real-time. Also, architecture decisions become easier to make if data engineers and data scientists share their knowledge and understand each other’s challenges.
These opinions are those of the author. Unless noted otherwise in this post, Capital One is not affiliated with, nor is it endorsed by, any of the companies mentioned. All trademarks and other intellectual property used or displayed are the ownership of their respective owners. This article is © 2017 Capital One.