Data Science Infrastructure with SMACK stack.
We have reached an era where silicon is becoming obsolete.Sorry don’t get it twisted! I mean data systems have embraced open source tools for data mining, processing, storage and visualization.These tools have exhibited the best properties suited for big data and data science.
Approximately 90% of the worlds data has been produced since 2016, according to IBM marketing cloud study reports.In 2017 reports indicated that there were 3.8 billion Internet users.This huge amounts of data is sourced from businesses,social media, sensors, mobile phones, cryptocurrency among others.There is need to mine this data as it seams,analyze it and visualize to avoid data becoming obsolete and loosing meaning.This is a major downfall of many systems and business entities which strive to dominate and optimize their operations.
The need for fast,robust,scalable,reliable and cost effective data science infrastructure ensued SMACK stack.This is a combination of powerful open source tools put together to handle big data pipeline. Isn’t that amazing? Well, The SMACK stack is capable of handling streaming data and processes in a flash of time.This is advantageous to decision making,maintaining uptime, and giving instant feedback.
Someone maybe wondering what does this mean?I hope no one is lost. this is just but an introductory of setting up a DC/OS on mesosphere Google cloud platform,. Later configure Google Compute Engine, set up SMACK stack using CLI, deploy a twitter streaming application and sit back watch tweets streaming in.
To just have a superficial scratch, lets look at what constitutes SMACK stack.SMACK stands for Spark, Mesos, Akka, Cassandra and Kafka.
Apache Spark processes and run workloads 100x faster.It achieves high performance for both batch and streaming data, using a state-of-the-art DAG scheduler, a query optimizer, and a physical execution engine.It is faster than its counterpart map reduce in hadoop.
Apache Mesos abstracts CPU, memory, storage, and other compute resources away from machines (physical or virtual), enabling fault-tolerant and elastic distributed systems to easily be built and run effectively.This is aided by marathon, schedulers and brokers and many more.
Akka is a light bend technology and a toolkit for building highly concurrent, distributed, and resilient message-driven applications for Java and Scala.Akka is the implementation of the Actor Model on the JVM.The application made withstands high traffic.No wonder you get it easy with amazon, paypal and walmart.
The Apache Cassandra is a database.It is the right choice when you need scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data.Cassandra’s support for replicating across multiple data centers is best-in-class, providing lower latency for your users and the peace of mind of knowing that you can survive regional outages.It has been proven that Cassandra is fault tolerant, scalable, allow data decentralization and consistency.
Kafka is used for building real-time data pipelines and streaming applications. It is horizontally scalable, fault-tolerant, wicked fast, and runs in production in thousands of companies.
When all these tools are well set and installed on a virtual machine with Debian environment and mesos.A Massive,fast and robust data infrastructure is build.To process data from different APIs with the likes of electrical systems, businesses, sensors and social media.
Lets familiarize with this tools before we embark on setting up these mind blowing data infrastructure.You don’t need physical servers, so don’t panic.It is a DYI hackit.
This is just a tip of an iceberg of what we are going to expect and do.We shall shade more light on each and relevant tools later.
Give more claps to this post to signal others so that we can journey together in the setting up of this project.See you next time.