Beauty of Apache Samza

It’s not easy to deploy and manage stream processing applications written in Samza. Deploying and managing Storm topologies (stream processing applications written in Storm are called topologies) is easy compared to Sazma. Samza doesn’t provide something like Storm topology out of the box. You have to build a topology by connecting multiple Samza jobs by writing output to Kafka topics and reading them downstream. In my opinion, we can add all these to Samza by writing another layer on top of base Samza. But Samza is still attractive as a stream processing framework due to:

Samza Job is a YARN application

A Samza job is a YARN application that takes care of resource allocation and fault-tolerance. Once you deploy a job, its life-cycle is entirely handled by YARN and the Samza application master specific to that job. So you don’t need to run and manage master processes or daemon processes. If you already have a YARN and Kafka cluster running in your environment, you can quickly get started with Samza. This method allows Samza to utilize any improvements come to YARN including resource isolation and security. Samza job being a YARN application also reduces the number of systems you need to maintain. You only need to manage YARN and Kafka.

Each Samza application being a separate YARN job has some downsides too. If you need Storm like a central dashboard for your streaming applications, you will have to build your own thing. But Samza provides extensible metrics system and per job web app if you need to create a custom monitoring and management tool. Also, if you are not familiar with YARN, it may take some time to getting used to Samza. But Samza comes with an excellent example to help you get started with Samza.

Everything is a stream

Samza tries to use streams as much as possible to implement everything from metrics, fault-tolerance to stream-to-relation joins. Samza encourages you to write your metrics to a Kafka topic in production and provides necessary tools for that. Samza implements checkpointing based fault-tolerance where Samza checkpoints to a Kafka topic (or basically to a stream). Then Samza has a concept called bootstrap stream where you use a stream to bootstrap a job. This stream may be a database change-log stream and during the bootstrapping process you can load existing data from a table to task local storage before start to joining incoming messages from an actual stream against the local storage.

I like the use of streams as described above and hope to utilize bootstrap streams to implement stream-to-relation joins in SamzaSQL.

Natively durable and fault tolerant

When using with Kafka, Samza uses Kafka to guarantee message delivery order (messages are processed in the order they were written to a partition). This feature is crucial when processing time-based window aggregations and joins and SamzaSQL uses this guarantee to implement window operators.

Samza takes care of snapshotting and restoration of local state; during task failures Samza restores the local state of the task to a known snapshot. Samza utilizes Kafka streams for snapshotting.

Good integration with Kafka

Even though we can theoretically plug any messaging system to Samza; IMO, current implementation’s design was heavily inspired by Kafka, and personally I think Samza is a computational layer for Kafka. Utilizing consumer managed offset concept, Samza allows to start reading Kafka topics from the beginning, or the first message comes after the job is started. Configurable initial offset enables replaying of streams for processing historical data. Or can be utilized to deploy a new version of job that starts from the beginning of the stream and when that job catches up we can stop the old version. Also checkpointing tasks and local state also utilize Kafka.

As with any software tool, there are some downsides as well:

  • It’s not easy to get started with Samza. At least, when comparing with famous Storm.
  • Each streaming app is a separate YARN application/job, and Samza lacks inbuilt support for centrally monitoring and managing them.
  • Lack of support for describing DAG of stream processing jobs is an another issue. It’s not impossible to build something, but it requires a considerable amount of effort.
  • Also, I personally prefer if there is a Java API or something similar to deploy Samza jobs easily.
  • Samza doesn’t have spouts like in Storm. Samza users need to implement a custom stream if they need to generate a set of test data to a Kafka topic. I think there should be a special type of tasks which can act as data source instead of asking users to implement a system.

IMHO, above are not deal breakers, and Samza is still a nice platform to develop stream processing applications.

Show your support

Clapping shows how much you appreciated Milinda Pathirage’s story.