Streaming Processing, Migrating to MapR
Streaming processing is a hot topic nowadays but the technologies behind it are changing constantly.
The team has tried different queueing tools until we got into Apache Kafka. Kafka is simple to use, faster than anything else, and remove the complex logic of tracking subscribers. Kafka lets subscribers to manage their own channels so they know how to read, where to read from, and they also control their own reading pace. Kafka is a very dummy pipe and we built smart producers and consumers around it.
We also gave a chance to Apache Storm to prove itself, however, Apache Spark was our final choice since we could do more with it, easily.
We went to the Cloud and back, we went to Hortonworks, and then to Cloudera, yet ending with MapR.
MapR has released recently MapR Streams and at that moment we asked ourselves why to keep our extra infrastructure that Kafka is using if MapR Streams is just ready to be used and it is integrated within the data platform?
Is it a good move to change to a new technology now that we have everything working?
Enterprise application changes are often difficult decisions to made, and they rarely rely only over the dev team. So how do we prove that we can make such changes with minimum impact to our already existing environment?
The answer is short, we don’t. We are actually changing our queueing system, but out applications are not aware of it.
Basically, we only need to change the configuration endpoints of our apps since MapR Streams uses the same Kafka’s API. Not all configuration parameters from Kafka are required in MapR Streams, but having them will not hurt us since they will be ignored.
We have to go through a small challenge since the MapR documentation only shows Java examples, especially for the dependencies of our app. Because we are using Scala, it would be nice if those example were also in the language of our choice, still not a big deal.
Another small complication we encountered was how to manage the sbt dependencies to the MapR Streaming libraries. That should not be a hard task, especially if you use this sbt-maven-plugin so you can read a maven pom.xml without problems. However, it would be nice if we could get the MapR binaries from the official Maven Repository.
We just need to add this line to the plugins.sbt file.
addSbtPlugin("com.github.shivawu" % "sbt-maven-plugin" % "0.1.2")
and then the pom.xml from MapR doc to our project folder.
MapR Streams has been built on top of Kafka API and there are few advantages when using it. Most of this specifics can be found in here, a very interesting post by Jim Scott.
If you are using a streaming technology you need to ask yourself how much longer it will hold your increasingly flow of data/events. Will your pipe be smart enough to scale while keeping its simplicity and performance? Most of the technologies out there will not, especially when processing billion of events. Kafka was for sure built for these tasks. However, MapR Streams solves much more complex problems without losing the simplicity and good performance Kafka offers since it sits on top of Kafka API and adds resolution to some common problems.
The changes to your application should not be hard to do. Even if configuration settings are hard wired in code, very small changes will be required to migrate from Kafka to MapR Streams. It is just a matter of dedicating some time to it while thinking about all the benefits you will get from this new architecture has been called to The Extreme.