Flavours of Streaming Processing
In his excellent recent blog post “Streaming 101”, Tyler Akidau made a great contribution to the streaming community by teasing apart certain notions that have been confounded by standard albeit increasingly obsolete practices. Within that same spirit of paving the ground for the streaming revolution, this blog post wishes to emphasise the observation that streaming does not necessarily mean approximate, with a particular focus on machine learning.
Let’s start with the simplest possible example, that of a linear sum: it can be easily computed in an incremental (exactly-once) processing manner:
The answer in this case is exact: it agrees precisely with a batch sum. Such examples are referred to as exact incremental or exact online processing.
However, not all queries can be computed exactly in an online fashion. A classical example is top-N, or a median. To understand why, it is best to define our terms: what does incremental or online really mean? Put simply, it means that the memory and time requirements of the process do not grow over time — despite the fact that the number of data entries processed does (since data streams are by definition semi-infinite — in Tyler’s terminology: unbounded).
Indeed, to maintain a sum, one only needs one to store one floating number in memory (the last value of the sum). To maintain an average, one needs to store two: the sum, and a count of how many data entries have been seen so far. However, to maintain a median, one needs to keep in memory an array whose size grows with the total number of data entries seen. Similarly for top-N. Consequently, the best we can do in a streaming context is offer approximate answers to queries such as a median, top-N, or a percentile calculation.
“Approximate answers” has been a longstanding criticism of streaming processing. MapReduce enthusiasts have always argued that, no matter how big your dataset is becoming over time, unless you desperately need real-time answers, you should always scale out and run batch to ensure correctness.
Needless to say that, as Tyler also mentions, approximate queries could be implemented in batch mode, too, in time-critical contexts. In fact, the initial motivation behind online processing was precisely that: to generate fast approximations by sequentially processing massive datasets.
However, as soon as we move beyond simple analytics such as sums and top-Ns, and into the realm of machine learning, the situation becomes a lot more interesting. Very few batch Machine Learning algorithms have exact online counterparts (one example being linear regression which can be exactly computed online via the Recursive Least Squares algorithm). Therefore, your favorite batch random forest will only approximately match its online version.
Does that automatically make the online algorithm inferior, or sub-optimal to the batch version? The answer is a definite (and perhaps surprising): No !
A major distinction between analytical queries and machine learning algorithms is that the former are fixed functions of the data, such as a sum or a rank, whereas the latter is a process, rather than a function, whose objective is to discover hidden patterns or rules that generalise well against future datapoints. There is therefore no theoretical reason why a batch random forest must necessarily outperform a well-designed online random forest.
In the seminal book Principles of Data Mining the chairman of our advisory board Professor David Hand and his coauthors make this point really nicely. They break down a data mining algorithm into the following constituents:
- The task to be solved (e.g., classification or clustering)
- The structure of the model/pattern being fit to the data (e.g., decision tree)
- The scoring function which measures how good the fit is (e.g., squared loss)
- The search/optimisation algorithm which seeks high-scoring patterns
Possibly the easiest way to understand this is to consider a neural network with a fixed architecture (i.e., number of hidden layers etc.), which has to be trained on a dataset of labelled examples. Because neural networks are infamous for producing optimisation surfaces with a huge number of local optima, no gradient descent method is guaranteed to identify the global optimum, and, indeed, conjugate GD might produce a different answer than simple GD. In this sense, an online algorithm using Stochastic Gradient Descent is just another candidate, which might or might not outperform the batch candidates. In any case, it makes no sense to consider the SGD answer as an approximation to the GD answer. In truth, they are both approximations of the unknown “true” decision boundary, and, if they are well-designed, they should both approach the “true” answer as the size of the dataset increases. Which one gets there faster is largely dependent on the problem and the class of methods. Put differently, in machine learning, both batch and online algorithms are approximations, that use data and computational resources differently.
A clarification is in order at this point: sliding windows are an extremely inefficient way to perform streaming learning, precisely because they are designed as approximations to batch learning, albeit starved of information due to the fact that they use only a fraction of the available data. Overarching frameworks for online learning such as stochastic gradient descent and stochastic approximation ensure instead that information is extracted and stored in the most efficient manner possible and updated sequentially.
But, surely, you might argue, the batch paradigm has more computational resources and can therefore support superior algorithms, right? As a general statement, this argument makes sense, but in practice, several statistical phenomena conspire to make online algorithms extremely competitive. We only mention here a few, and promise to analyse further in a future post:
- Overfitting. Online learning proceeds to improve its answers by continuously comparing its predictions with the next datapoint. This means that, by design, online algorithms are less prone to overfitting than offline methods, where a great amount of discipline is needed to avoid that pitfall.
- Simulated annealing. When faced with complex optimisation surfaces riddled by multiple optima, the noise introduced by sequential data processing is a very successful way to escape local optima. In fact, online learning was first introduced in the machine learning literature as a way to train neural networks in batch mode, as it was observed that feeding the data to the learning algorithm sequentially can result in better answers.
- Temporal drift. Most real datasets feature a certain amount of temporal variation which might be too subtle or too complex to model explicitly (as opposed, for example, with clear periodicities or jumps). The data decay which is a built-in feature of most online learning algorithms offers robustness against this ubiquitous phenomenon, whereas batch algorithms need substantial refactoring to accommodate drift. This is in fact a critical and under-addressed point, which we may address in a subsequent post.
Finally, are streaming algorithms really “more complicated”? We think this question is moot. Some models admit simple algorithms whereas others don’t, and the same holds of streaming versions. Moreover, online learning is no longer as fragmented as it was ten years ago, since overarching frameworks such as Stochastic Approximation and Stochastic Gradient Descent haved matured to the extent that it is now possible to reuse a lot of components in a streaming machine learning library.
Let’s recap our key observations:
- In discussing the comparative accuracy of streaming vs batch processing, one must crucially distinguish between analytics and machine learning.
- For analytical queries, some cases such as a sum or a standard deviation are possible to exactly compute online (i.e., using constant resources), whereas others can only be approximated (e.g., top-N), albeit very well.
- In machine learning, both batch and online algorithms offer approximations to an “unknown truth” (e.g., a decision boundary or a cluster structure), and should be assessed according to their generalisation ability rather than their in-sample performance, so as to avoid overfitting. Therefore it makes no sense to consider online algorithms as approximations to batch algorithms.
- Despite their computational efficiency, online learning algorithms have built-in features that can often make them outperform batch counterparts.
Originally published at www.ment.at on 28-Sep-2015