Apache Spark — The Largest Open Source Project In Data Processing

Simon Sugob
HireDevOps
Published in
2 min readDec 28, 2018

What is Apache Spark?

Apache Spark has quickly become the largest open source community in Big Data, with over 1000 contributors from 250+ organizations. Big internet players such as Netflix, eBay and Yahoo have already deployed Spark. In just eight years this 100% open source has expanded to be equivalent with real-time Big Data analytics [https://databricks.com/spark/about].

Matei Zaharia, Co-Founder and CTO at Databricks explains what Apache Spark is:

In about 80 percent of use cases, people’s end goal is to do data science or machine learning. But to do this, you need to have a pipeline that can reliably gather data over time.

Both are important, but you need the data engineering to do the rest. We target users with large volumes, which is more challenging. If you are using Spark to do distributed processing, it means you have lots of data. [Matei Zaharia]

The latest version of Apache Spark 2.4 has been launched 2 Nov 2018. This release has a lot of new features such as barrier execution mode for ML applications, optional eager evaluation for previewing DataFrames in Jupyterhigher-order functions in SQL or Scala 2.12 support. [https://spark.apache.org/releases/spark-release-2-4-0.html]

Why Apache Spark defeated Big Data world?

The biggest Spark advantage is its speed and ease of use. Spark enables applications in Hadoop clusters to run up to 100 times faster in memory and 10 times faster even when running on disk.

What Apache Spark is about and why you can not treat it as Hadoop competitor but compensator is clearly explained in Apache Spark Introduction for Beginners by Vikash Kumar, Tatvasoft. In this nicely written article, you will find a brief overview of:

• Apache Spark Architecture

• Ecosystem Components

• Abstractions & Concepts

• Features

• When to Use and When to Not Use Apache Spark?

How to Get Started with Apache Spark?

Databrick provides a series of tutorial modules. “You will learn the basics of creating Spark jobs, loading data, and working with data. You’ll also get an introduction to running machine learning algorithms and working with streaming data.” Here is the link to comprehensive resources: https://docs.databricks.com/getting-started/spark/index.html

What else can you do?

→ Create your account on GitHub and contribute to Apache Spark development https://github.com/apache/spark

→ Do not forget to register at SPARK + AI SUMMIT 2019 — THE WORLD’S LARGEST EVENT FOR THE APACHE SPARK COMMUNITY | April 23–25 | San Francisco https://databricks.com/sparkaisummit/north-america

Other resources:

  1. How Spark Conquered the Big Data
  2. Youtube Video from the previous Spark + AI Europe Summit in London [2–4 Oct 2018]: Distributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling (Logical Clocks AB)
  3. Apache Spark creators set out to standardize distributed machine learning training, execution, and deployment

Original post comes from https://www.hiredevops.org

--

--

Simon Sugob
HireDevOps

Senior DevOps @HireDevOps. Interested in #Kubernetes #AWS #IaC #Terraform #CDPipelines #DevSecOps #FaaS #CloudNative | hiredevops.org