SKACK IS THE NEW SMACK

People who are familiar with SMACK stack(Spark, Mesos, Akka, Cassandra, and Kafka) often find themselves working in the spectrum of big data. This has been proven to be a useful stack as each of these frameworks are tested and proven to scale

Spark is an open-source distributed general-purpose computing framework. It is built on top of Hadoop to use in-memory computation and can handle both iterative and exploratory data processing. It is built on resilient data structures like RDDs ( GraphFrames, Datasets built on rdds) and generates a DAG of actions before executing the jobs.

Research paper: http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf

Mesos is a cluster management tool designed to

  • Abstract data center resources
  • Improve cluster utilization by colocating diverse workloads
  • Management of apps like deployment, self-healing, scaling, and upgrades
  • Provide evergreen extensibility
  • Elastically scale from ten to thousands and more.

The following paper is a must-read to understand the design principles of Mesos.

However, in my experience, I find SKACK (spark, Kubernetes, Akka, Cassandra, and Kafka) to be more flexible. The comparison between Kubernetes and Mesos is not apt as k8s is a cluster container orchestration tool while Mesos is a full-fledged cluster management tool. But in service-oriented architecture where teams are managing using Iaas (infrastructure as service) or Paas(Platform as service), k8s has penetrated easily, thanks to its simple and elegant abstractions. Though k8s does not offer, data locality across different stateful applications compared to Mesos, its ease of use has lead to a widespread adoption of k8s in the industry.

k8s is an open-source cluster management tool for container orchestration.

  • Provisioning and deployment
  • Service Discovery and DNS resolution
  • Scaling
  • Monitoring (Health and Liveness)
  • Management (Rollouts & rollbacks)

Everything in k8s is entirely designed around its restful API-Server, which is responsible for doing the actual work in reality what a developer intends to do which is often described using declarative abstractions. There are no private privileged API or other magic system-only calls. The abstractions such as

pods, jobs, services, Replicasets, Deployments, statefulsets, constitute a good understanding of the mental model of kubernetes.

Cassandra is a free and open-source, distributed, wide column store, NoSQL database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. Cassandra offers robust support for clusters spanning multiple data centers, with asynchronous masterless replication allowing low latency operations for all clients.

Apache Kafka is a distributed publish-subscribe (pub-sub) messaging system which can handle a high volume of data. It is suitable for both offline and online message consumption. Data (in terms of messages) is persisted on the disk and replicated within the cluster to prevent data loss in the event of node failure/network failure. It integrates very well Spark for real-time streaming data analysis.

SKACK

Spark — General purpose data processing framework. (Batch Processing)

Akka — Actor System is a toolkit designed for parallelism, concurrency at scale. (Streaming and other actor models)

kafka — General purpose Message publish-subscribe system (Intermediate storage).

Cassandra — Horizontally scalable NoSQL database for data persistence(permanent storage).

All of the above-mentioned applications can be following can be containerized and could

Image depicting SKACK architecture
Image Showing k8s pods for SKACK

Spark — Deployment with master and worker pods.

Akka — Deployment with pods acting as seed nodes and worker nodes.

Kafka — Statefulsets with each pod acting as a broker

Cassandra — Statefulsets with pods forming a ring

Helm charts for installing SKACK stack could be found here at SKACK charts