Hey, Apache Spark
If you’re familiar with big data, then you’ve probably heard of Apache Spark. Spark is a powerful open-source data processing engine that makes it easy to analyze large datasets. In this post, we’ll take a look at some of the reasons why Spark is so popular and explore some of its key features
Apache Spark is a computation engine and an Apache Spark is a computation engine and a stack of tools for big data. It has capabilities around streaming, querying your dataset, Machine Learning (Spark MLlib), and graph processing (GraphX). In this post, we will go over what is a Apache spark and it’s usecases. Spark is developed in Scala but has bindings for Python, Java, SQL, and R, too. Spark relies entirely on in-memory processing, which makes it manifold times faster than the performance of respective Hadoop functionalities.
Spark’s core is a resilient distributed dataset (RDD), a read-only multiset of data items distributed over a cluster of machines, that is maintained in memory across operations. This fundamental abstractions enables Spark to support both batch style computations, e.g., map-reduce jobs, as well as interactive queries. RDDs are constructed by starting with one or more external datasets in HDFS or any Hadoop supported file system, and transforming this data using parallel operations such as map, reduce, join and cogroup.
As a standalone cluster manager, Spark supports automating common cluster resource management tasks (e.g., monitoring, scheduling), leveraging built-in primitives that can be easily composed into higher level application logic. The library is available on the standard Hadoop distribution or separately for download.
Spark is a multi-purpose engine. It can be used for batch processing, interactive queries on datasets stored in HDFS or other file systems. To handle streaming data from various sources, Apache Spark provides real time stream processing capability with Spark Streaming that can analyze and process data in-memory or through disk persistence. In the future releases of Apache Spark, it will support new front end languages like R and Python to provide interactive analytics against large dataset without requiring us to write any code.
MapReduce and Spark comparison
With the advent of Spark, the MapReduce framework took a backseat due to several reasons mentioned below:
- Iterative jobs: Certain Machine Learning algorithms make multiple passes on a dataset to compute results. Each pass can be expressed as a distinct MapReduce job. However, each job reads its input data from the disk and then dumps its output to the disk for the next job to read. When disk I/O is involved, the job execution time increases manifold when compared to the same data accessed from the main memory.
- Interactive analysis: Users can run ad-hoc SQL queries on large datasets using tools such as Hive or Pig. If the user issues multiple queries targeting the same dataset, each query may translate to a MapReduce job, read the same dataset from disk, and operate on it. Having multiple MapReduce jobs read the same dataset from the disk is inefficient, and increases query execution latency.
- Rich APIs: Spark, by offering a variety of rich APIs, can succinctly express an operation that would otherwise consist of many lines of code when expressed in MapReduce. The user and developer experience is relatively simpler when working with Spark, as compared to MapReduce.
Resilient Distributed Datasets (RDDs)
- Resilient: This means an RDD is fault-tolerant and able to recompute missing or damaged partitions due to node failures. This self-healing is made possible using an RDD lineage graph that we will cover later. An RDD remembers how it reached its current state and can trace back the steps to recompute any lost partitions. Distributed: When data making up an RDD is spread across a cluster of machines. Datasets: This refers to representations of the data records we work with. External Data can be loaded using various sources such as JSON files, CSV files, text files, or databases via JDBC.
Properties of RDD
- Resilient: This means an RDD is fault-tolerant and able to recompute missing or damaged partitions due to node failures. This self-healing is made possible using an RDD lineage graph that we will cover later. An RDD remembers how it reached its current state and can trace back the steps to recompute any lost partitions.
- Distributed: When data making up an RDD is spread across a cluster of machines.
- Datasets: This refers to representations of the data records we work with. External Data can be loaded using various sources such as JSON files, CSV files, text files, or databases via JDBC.
Why switch from Hadoop
If you are using Hadoop and want to level up your data processing, migrating to Spark may be a good choice.
All migrations, though, cost work hours. Here are some good reasons to migrate:
- Spark does the processing in-memory making it many times faster than Hadoop MapReduce.
Spark claims to be up to 100x faster compared to Hadoop MapReduce.
The project’s adoption leads to high-quality documentation, courses, and books that can help you do it right. On the contrary, on MapReduce, where you deal with text streams exclusively, there are various high-level abstractions in Spark, with the most common one being the RDD.
- Pretty much everything we need to do regarding our data can be done with Spark. As mentioned above, Spark can perform streaming activities, Machine Learning, querying data, and graph processing. Of course, we can do any data manipulation with the RDD level.
- Spark shell for interactive data wrangling. Just like we prototype our Python code in the Python shell, we can do the same in the spark-shell with our Spark jobs.
MLlib vs. Tensorflow/Pytorch
There are numerous super-popular and well-documented frameworks around Machine Learning, like Tensorflow and Pytorch. So use MLlib arise?
A big reason to use any tool from the Spark ecosystem is its distributed nature. In addition to performing in-memory computation, Spark can do it over a distributed file system.
This helps with scaling the process, and you don’t have to learn a second technology that might not be compatible with Spark.
Remember the tennis balls examples from an earlier lesson? We could write a model that predicts which color the next ball would be. Since we already have the data in HDFS, we could utilize the Spark integration with HDFS and run our Machine Learning model there.
Spark and Hadoop MapReduce
Hadoop was a breakthrough in big data processing when it came out. Though it is still a popular tool, Spark outperforms it in many areas, such as performance and real-time needs.
Spark runs on memory, and this alone can be a game-changer.
Applications of Spark
- Trend calculations.
- Business intelligence.
- Summarizing a corpus using graph algorithms like TextRank with GraphX.
- Real-time detection of fraudulent payments using Spark Streaming and MLlib.
- Implement an ETL pipeline with Spark Streaming.
To learn more about how Spark works, visit: https://spark.apache.org/docs/latest/index.html#how-it-works
Apache Spark is an open-source cluster computing system that can be used to perform analytics on large data sets. The more you know about how it works, the better equipped you’ll be for analyzing big data and using its power to your advantage with machine learning techniques, Until next time, Adios