Experiences with the SMACK stack for Data Analytics Infrastructure
Here is a summary of the experience gathered on various projects done with the SMACK stack.
For details about SMACK you might want to take a look at the following blog — Hands on with the SMACK stack.
- Apache Spark — the S in SMACK — is used for analysis of data — real time data streaming into the system or already stored data in batches.
- Apache Mesos — the M in SMACK — is the foundation of the stack. All of the applications do run on it. In our cases we’ve been using Mesospheres DC/OS on top of Apache Mesos for the installation and administration of the stack and our own applications.
- Lightbend’s Akka — the A in SMACK — is used for fast data stream processing. In most use cases it’s been either used for fast ingestion of data or for fast extraction through in-stream-processing.
- Apache Cassandra — the C in SMACK — is a fast write and read storage for the fast-data-processing platform.
- Apache Kafka — the K in SMACK — is the intermediate storage for streaming data. It helps to decouple in and out application logic while still being fast enough to add no overhead of time on the stream of data.
In those projects the architecture has looked roughly like this
The ingestion, implemented with Akka, is kind of like Enterprise Integration on steroids. Instead of having a lot of different connectors you’ll end up with just a few entry points, but doing so in a very, very fast way. For each of the projects it had been a requirement to have fast input and storage of the data as well as having that data visible in near real-time. That’s where Akka comes into play again — either being connected to Kafka for real-time streaming of data via Websockets or as connector for Cassandra.
All of the scenarios running this stack have been build on AWS as cloud provider. This made it especially easy to setup and tear down the stack for development.
DC/OS — Apache Mesos
Apache Mesos in combination with Mesospheres DC/OS is the foundation for working with the SMACK stack. The Mesos
kernel runs on every machine and provides applications (e.g., Hadoop, Spark, Kafka, Elasticsearch) with API’s for resource management and scheduling across entire datacenter and cloud environments.
On top of Mesos, Mesosphere’s DC/OS will give you a command line interface and a nice integration with Mesosphere’s Marathon. Especially the first one in combination with Ansible can be used to automate the setup of a whole cluster.
The following shell command installs the Cassandra framework used by DC/OS.
dcos package install cassandra
In combination with the dcos command dcos package list it’s possible to verify the success of the installation.
While Apache Mesos is used as “Kernel”, Marathon is used as “init.d” on top of it. Marathon makes sure deployed applications are running successfully. In case of a failure those applications are restarted by Marathon. While Marathon takes care of long running tasks, Mesosphere’s Metronome is in charge of short running cron-like tasks.
Marathon and Metronome are so called Mesos frameworks. A framework takes care of reserving resources from Mesos, schedules the application to be launched and sometimes monitors those applications.
The nice thing about DC/OS is that it is very easy to run Apache Cassandra via a specialized Mesos framework. This framework not only helps in installing a Cassandra cluster on top of Mesos, it also makes sure of handling recovery of failed instances. The configuration and sizing of the Cassandra is crucial for having a high performance fast data platform based on this SMACK stack.
For performance reasons it’s best to start with a SSD hard drive, but also EBS volumes can be used. Using EBS we never experienced any shortcomings, though it seemed we had an increase in write queues. This usually happens if commit log and SSTables are written to the same storage. In cases like these it’s crucial to have a fast connected EBS at hand.
In short, more CPU will give you more throughput. As Cassandra uses different thread pools for write and read paths, an increase in number of CPUs helps tremendously, as those thread-pools have more dedicated CPUs.
As Cassandra had commodity hardware in mind when being designed, a lot of heap is of no use for a cassandra instance. A lot of RAM is rather useful for the underlying system itself, as the page cache will consume all the memory not used by other applications. Therefore configuring Cassandra with 8GB of RAM as heap plus 24GB for the underlying system for page caches is sufficient. It’s rather crucial to make sure the new generation heap is configured properly. As recommended in the Cassandra documentation the new generation of the heap should be set to 1400MB. Which is equivalent to 100MB times the number of CPUs. The rule of thumb is to either use 100MB times number of CPUs or a quarter of the maximum Heap, where the lesser number needs to be used.
As the system is run with a Java 8 runtime, garbage collection can be set to the GC1 garbage collector.
Like Apache Cassandra and other Big-Data systems, Kafka has also been designed with commodity hardware in mind. Therefore around 5GB of Heap for the Kafka process is enough. Again here it’s more important to have enough RAM for the hard drive caches as for Kafka heap usage. Regarding storage for Kafka the same principle applies. SSD should be favored but a fast connected EBS storage is sufficient. Kafka and its thread pools also profit greatly by a higher number of CPUs.
As Cassandra and Kafka in a production use-case are consuming rather lot of cpu and harddrive, one should consider to run those frameworks on dedicated machines. This does break the rule of “every framework is treated equal”, but especially the page caching mechanism only works if those applications are the only user of all resources.
Apache Spark is already optimized to run on Apache Mesos. So after the easiness of installing it via a dcos package install spark, spark is ready to be used. Spark comes with a default Mesos scheduler, the MesosClusterDispatcher also known as Spark Master. All spark jobs will register themselves with the master and in turn will also be a Mesos framework. This driver is negotiating with Mesos about the required resources. From there on it takes care about the executors for Spark. Within this scenario of using DC/OS the executors are docker images with a DC/OS optimized configuration. They already contain configurations for access of a HDFS inside the DC/OS cluster.
Spark metrics are nice while being watched in real time, but it would be nice to have the metrics available all the time, especially since after the death of the driver this data is gone. One way is to use the Spark history server. The history server is nice, but requires an HDFS to be available. This alone isn’t much of a downside but the requirements on running HDFS on DC/OS is rather high. At least 5 instances are required with lots of hard-drive. This just for taking a look at the Spark metrics is rather expensive. Therefore a good possibility is to use ELK (ElasticSearch, Logstash and Kibana) for monitoring. But how do we get to the logs of Spark. In a “regular” environment, you’ll usually just add some logging details to the executors, but as our executors are started by mesos and managed by the Spark Driver this needs some extra tweaking in the DC/OS world.
To enable the spark metrics a configuration is needed as the following:
# Enable Slf4jSink for all instances by class name
# Polling period for Slf4JSink
With these settings Spark logs all metrics available to the std logging mechanism. The tricky part is to actually have those settings enabled inside the docker image provided by DC/OS. Actually this isn’t possible, so we built a custom Docker image already containing these settings.
The Dockerfile for such a preconfigured Docker image can be seen below, it’s not much magic.
ADD ./conf/metrics.properties /opt/spark/dist/conf/
As you’ve seen in the previous section, enabling a HDFS system can be quite costly in terms of storage. In this scenario an extra HDFS system isn’t required, therefore storing data in S3 is a quick win. As with the metrics, DC/OS’ own spark image is optimized for HDFS and therefore needs some tweaking for accessing S3.
First, it is crucial to have those S3 accessing libraries available in your docker image. Second, you’ll need to make sure those newly available libraries are present in the configuration of your executor.
In the configuration above, there is an additional section setting the filesystem type to S3A which is needed for a faster access to it. The Apache Hadoop driver supports two different ways of accessing S3, one the default S3 is a block based access on the S3, while the S3N or S3A use object based access on S3. Hadoop now only supports S3A as it’s the successor of S3N (native). The extraClassPath entries are needed for the driver and also for the executor. An additional step of creating your own Docker image is to make sure those libraries are available.
Using a custom Docker spark image
When submitting Spark jobs via DC/OS usually you’ll issue a command as the following:
dcos spark run --submit-args='--driver-cores 0.1 --driver-memory 1024M --class org.apache.spark.examples.SparkPi https://downloads.mesosphere.com/spark/assets/spark-examples_2.10-1.4.0-SNAPSHOT.jar 10000000'
Now if you want to use your own custom docker image you need to adapt the command to look like the following.
dcos spark run –submit-args =’…’ –docker-image=my-own-docker/spark-driver
In that case you need to make sure this docker image is accessible to your Mesos agents.
Because installing those Spark job can be cumbersome, we used Metronome for an easy installation of those Spark jobs.
After all those technical implications let’s take a look at what can be achieved with such a SMACK platform. Our requirements contained the ability to process approximately 130 thousand messages per second. Those messages needed to be stored in a Cassandra and also be accessible via a frontend for real time visualization.
This scenario is build on the basic architecture. Akka Streams are used for ingestion and also for the real-time visualization. The ingestion does some minor protocol transformation and publishes the data to a Kafka sink. While 4 nodes are capable of handling these 130k msg/s, 8 ingestion instances are needed to keep up with 520k msg/s.
A high throughput in the ingestion is nice, but how much delay does it produce? For example: are those messages queuing somewhere internally Is the back-pressure to high? While being able to handle the throughput of 520k msg/s the delay has an average of 250ms of latency between message creation and storage and Kafka. If the network and ntpd fuzziness is taken into account, the delay produced by the ingest itself is comparable to be nonexistent.
So the ingestion and therefore the visualisation consumer are capable of handling the data in real-time. How does storing the data in Cassandra compare to this?
The easiest use-case for Apache Spark in this scenario is to stream the incoming data into the Apache Cassandra database. While Kafka topics are perfectly suited to be consumed in a streaming way, Cassandra isn’t capable of doing streamed inserts. That’s one of the reasons the largest amount of time is consumed by connecting and communicating with Cassandra. Therefore it’s better to have a batch interval of about 10 seconds. The base-line for connection and minimum time to just get going is about 1 to 2 seconds depending on the underlying hardware and amount of CPUs allocated for the executors. The 10 second window frame also helps handling possible but not envied downtimes of the Spark Job. With this time-buffer it’s possible to handle a longer downtime. Another reason to keep the 10 second window, data might happen to be hold back in the pipeline before consuming in Spark. This will also address this issue of a peak-load.
So let’s talk numbers: a Spark job storing data into Cassandra batched for 10 seconds takes about 4 seconds for 1.3 Million events within the batch window. To store the 5.2 Million events inside the batch window, Spark takes 8 seconds to store those events in Cassandra. As Cassandra is a rather large stakeholder when streaming data from Kafka to Cassandra via Spark, it also needs to scale horizontally with the amount of data. The following table gives a brief overview of amount of messages, average latency per batch and amount of Cassandra nodes needed to handle the amount of messages per second.
Batched 10sCassandra Nodes130k msg/s4s5260k msg/s6s10520k msg/s8s15
When measuring the throughput of simple events of approximately 60 byte size, the initially upper limit was reached by around 500k msg/s. It turned out to be a network capacity limit of the selected AWS box. This limit of 20MB/s correlated nicely with the approximately 500k msg/s and their corresponding byte size. When selecting another type of box with better network settings as entrypoint for the incoming data, processing 520k msg/s was easily achieved.
Sometimes working with bleeding-edge software is more like grabbing into falling daggers, on the other hand it’s great to see it turn into an extremely powerful and capable stack. With time progressing, the Mesosphere DC/OS provided functionality turned more stable and more sophisticated. The out-of-the-Box frameworks for the SMACK stack play nicely, and due to those frameworks the stack is resilient concerning failures. A “misbehaving” app will be stopped by the framework while being restarted in the same minute. This failsafe and failure-tolerant behavior gives great confidence in the running cluster as it rarely needs operation to be in charge.
It turned out to be obvious, but this stack scales linearly regarding performance, throughput and used hardware. But not only does it scale up, it also helps to start with a smaller stack. Especially using Apache Mesos as foundation helps to get the most out of your provided resources. This is especially useful when still developing the stack. In this case Apache Cassandra and Apache Kafka may be on the same node, as throughput isn’t the main goal, yet.
While Monitoring is needed it’s still not provided out of the box, a custom solution is needed. For monitoring purposes it’s still required to have your custom solution, with DC/OS 1.10 you’ll most likely will have a solution for metrics. DC/OS provided Frameworks place a collecting module next to the application, collecting the metrics of that process. Those metrics are locally collected and sent to a central server based on Apache Kafka. This will be an exciting new functionality for measuring metrics of your cluster in future.