Second Act of Realtime Analytics: Stream Processors and CEP Engines are merging

In 2015 December, I wrote a Quora answer explaining the difference between Stream Processors and CEP, which later converted to a blog, which again later picked up of SE daily blog.

Following is an quote from the blog.

Traditional CEP cannot handle those IoT use cases in their current form. Most IoT use cases would have very high event rates. Therefore, whatever event technology used in those use cases needed to be able to scale up. Stream processing can scale much better than CEP.
At the same time, I believe it is a mistake to ignore the higher level temporal operators introduced by CEP and asking the end users to write their own operators. You can find my thoughts from Patterns for Streaming Realtime Analytics and SQL-like Query Language for Real-time Streaming Analytics.
The good news is that both technologies: CEP and Stream Processing are merging and the differences are diminishing. Both can learn from the other, where CEP needs to scale and process events reliably while event processing needs high-level languages and lower latencies. IBM infosphere, which is a stream processing engine, have had CEP like operators for a long time. WSO2 CEP can now accept SQL-like queries and runs on top of Apache Storm (more details). SQL stream is a CEP engine that is highly parallel. My belief is that we will end up with a combination of both and we all will be better off for it.

To my obvious satisfaction, the merging on Stream Processing and CEP has begun. As I told in the eariler post, IBM Infosphere and WSO2 CEP have both CEP as well as stream processing for some time. Now, this space is crowded.

New contenders are

  1. Apache Flink: Introducing Complex Event Processing (CEP) with Apache Flink
  2. Storm SQL: https://issues.apache.org/jira/browse/STORM-1040 and http://storm.apache.org/releases/2.0.0-SNAPSHOT/storm-sql.html
  3. Spark Streaming ( from 2.0): http://spark.apache.org/docs/latest/streaming-programming-guide.html#dataframe-and-sql-operations

Among the three, Fink version is most sophisticated with support for simple temporal patterns ( although it still falls short of classical CEP operators).

In my opinion, with this, we are entering the second act of realtime analytics. We have moved beyond running simple counting example in streaming fashion, and now looking to do more and more real use cases. In the process, we have figured out that it is not practical to rewrite each operator from the scratch, and instead building reusable operators which we compose to build our solutions ( You can find a detailed discussion about this from Why we need a SQL like query language for Realtime Streaming Analytics?)

Finally, Stream processing will be driven by IoT use cases ( as per Forester, IoT analytics will be as seven times big as the devices market in IoT [1]). Following are some features, in my opinion, which will decide winners and losers.

  1. Scalability
  2. Fault Tolerant Processing (with support for Exactly once processing when needed)
  3. Our of order and missing event handling — IoT events, by definition will come from lot of devices, and there is no way you can order events by their occurrence before sending them.
  4. Ability to backtrack and replay events — some analysis would need to backtrack, because you found a interesting pattern or because you found a bug in your calculation.
  5. First class support for time series processing
  6. Ability to apply machine learning within Stream processing ( either train in batch fashion and use models with streaming data, or do streaming machine learning e.g. see https://samoa.incubator.apache.org/)

In summary, If you like streaming analytics, exciting times are ahead.

Update 2017 September: WSO2 CEP is now called WSO2 Stream Processor, which is freely available under Apache Licence 2.

[1] Paul Miller, Brief: Streaming Data From The Internet Of Things
Will Be The Big Data World’s Bigger Second Act, July 5, 2016