Apache Spark Internals — Part 1, don’t get carried away with ease of use
By: Suresh Matlapudi (IBM) - Opinions are my own
There are many reasons for Apache Spark to rapidly gain a huge mind share success in the Analytics world. One reason stands out, it’s ease of use with well-designed high level APIs. Spark shines in this area compared to other parallel data processing frameworks that I have worked or studied with.
The down side of this is, it is easy to get our work done without a need to know the internals or to understand the architecture of this well-designed framework. Apache spark is increasing its market share rapidly and many predict that it will become the de-facto data processing framework for Big Data and Analytics.
If you are using Apache Spark or plan to use Spark, it’s worth your time to learn the internals for optimal usage. Also, it is fun to actually understand a good designed system instead of just using it.
While everyone has their own way for studying software systems, I want to share my way of doing it, that is to run various tests to gain more understanding of its internals. Not all concepts can be tested. But the thinking process involved in how to test a concept or component improves our understanding of it.
Background on my experience while starting with Spark
My Big Data journey started with Apache Hadoop before Apache Spark gained popularity. After initial experimentation with Java map-reduce API, I quickly settled with Apache Pig for data processing, Hive for querying and Flume/Sqoop for data ingestion as my technical stack for building data pipelines.
Meanwhile, Apache Spark was gaining lot of mind share (and market share) and becoming part of various Hadoop distributions. During that time, I visited Apache Spark’s website to understand the following two images (below), I nodded my head (with appreciation) and told myself to start evaluating spark for my next projects.
It continued until I had to implement a project which involved high volume streaming data ingestion and processing. So, I evaluated the available options for steaming platforms for Big Data, at that time, I decided to use Spark Streaming mainly due to its generality and opportunity to use a single framework for different workload types.
This project was like riding a motorcycle before learning how to ride a bicycle. That is using - spark streaming API without much knowledge on Spark core APIs and internals. By using Spark’s good documentation (API), with help from many articles from the ASF community, within a short time, I could have it ready for its release. But the implementation and optimization took a long time for obvious reasons. For me, it is a testament to Spark API’s ease of use.
After this project, I decided to dig deep into Spark’s architecture and learn its internals. To overcome my weakness and focus on theory via documentation/books, I primarily ran run small tests on what I read and confirmed my understanding.
I wanted to share the test results which focused on gaining in-depth understanding of Spark internals. In this part of the blog I will focus on a test environment setup and running a couple of basic tests.
· It will help to have some basic understanding of Hadoop as I am using YARN as cluster manager and HDFS for storage for advanced tests.
· I will be using Spark Scala API for advanced examples. Some foundational understanding of Scala will help as the code samples are used in subsequent parts of this blog.
While Apache Spark can run without Hadoop, given the ubiquity of Hadoop in existing Analytics environments, it is common to use Spark along with Hadoop. I will include examples mostly using Spark on Hadoop.
If you don’t have an existing Hadoop cluster with Spark installed, you can sign up on IBM Bluemix and spin up BigInsights for Apache Hadoop with Apache Spark within a few minutes. Follow instructions from below video.
Make sure to select Apache Spark in the components while creating BigInsights cluster with 4 data nodes. Also, let’s take look at my test cluster in below video.
I am using spark version 1.6.1 for executing my tests. So, I will be referring to spark documentation link for this version. I have yet to dig deep into current version (2.0.1) of Spark to gather information on the new changes. But most of the points should be relevant with new versions as well.
High Level Architecture
It is important to get a mental picture of high level architecture of the system to begin with. Take a good look at this below diagram from spark documentation and read high level overview of spark architecture in cluster mode (executing on Hadoop using YARN as cluster manager in our case) from spark documentation at http://spark.apache.org/docs/1.6.1/cluster-overview.html. Make sure to give special attention to Glossary section to get familiar with various components/terms involved in the architecture.
Each Spark application runs as independent sets of processes called executors and they are orchestrated by a single process called a driver program.
Test #1 — Spark Application, Driver Program and Executors
One important point from Spark documentation is, each spark application should have its own instance of driver process and set of executors. In fact, no processes (containers in the context of Hadoop cluster) or services should be running if there are no spark applications executing within that cluster.
Let’s validate these simple concepts by running Spark applications. Below is the video for these tests executed on my test cluster.
Test #2 — Spark Application Size
Considering that data processing happens on the executor’s side, it is obvious that the number of executors determines size of the Spark application. At this point we should ask “who decides on the number of executors” for the application. You may recall from test #1 that we can pass a number of executors while starting the applications. Let’s take a close look at Spark documentation on Submitting Applications at http://spark.apache.org/docs/1.6.1/submitting-applications.html.
In addition to number of executors (with — num-executors flag or spark.executor.instances parameter), the size of each executors (memory and number of cores) determines the overall capacity of a Spark application. There are multiple ways to configure the size of spark executors. While default values can be set in Spark configuration files (spark-env.sh or sprak-defaults.conf), most common way is to set this while submitting the application by passing executor-memory flag or spark.executor.memory parameter for memory and executor-cores flag (or spark.executor.cores parameter) or total-executor-cores flag for number of cores. Also on Hadoop cluster, YARN container size limitations (min/max values) need to be considered while deciding on executor resources.
Let’s run few tests to validate these concepts.
While setting the number of executors during job submission gives you control on application size, it may not be sufficient in all scenarios. For example, if you have to schedule a job with variable workloads between the runs, configuring the job for max work load is going to waste cluster resources during low workloads, and vice versa. Spark’s Dynamic Resource Allocation can help us with this issue by providing an ability to adapt to workload changes. Dynamic Resource allocation scales the number of executors registered with an application up and down based on the workload
Testing Spark’s Dynamic Resource Allocation needs some preparation as we need to setup an external shuffle service in the cluster. I plan to continue this testing along with other advanced concepts such as; Spark Jobs, Stages, Tasks, Job scheduling, RDDs and Memory Management.