Run The Simple End-to-End Example From Book “Spark: The Definitive Guide” Using Covid-19 vaccination data

7 min readNov 29, 2021

In chapter 2 “A Gentle Introduction to Spark” of this book By Bill Chambers, Matei Zaharia, they have provided an end-to-end example to introduce basic Spark concepts. Now we’ll run this very simple code in our lab env described in “Run pySpark job on YARN cluster using JEG” (attached below). The purpose is to see how Spark splits data into partitions and how data shuffling happens under the hood.

flight data used in the end-to-end example is 7KB which is too small to see Spark partition/shuffle behaviors clearly. so we’ll use a bigger data set (147.5MB) from Kaggle called “Covid-19 Vaccinations in the United State, County”. We also changed a little bit the example code in this test.

Code we’ll run is as following

covid_vaccination = spark\
  .read\
  .option("inferSchema", "true")\
  .option("header", "true")\
  .csv("file:///shared/data/COVID-19_Vaccinations_in_the_United_States_County.csv")spark.conf.set("spark.sql.shuffle.partitions", "2")covid_vaccination.createOrReplaceTempView("covid_vaccination_tbl")
maxSql = spark.sql("""
SELECT Recip_State, sum(Series_Complete_Yes) as recip_total
FROM covid_vaccination_tbl
GROUP BY Recip_State
ORDER BY sum(Series_Complete_Yes) DESC
LIMIT 60
""")
maxSql.cache()maxSql.createOrReplaceTempView("maxSql_tbl")
selectSql = spark.sql("""
SELECT Recip_State, recip_total
FROM maxSql_tbl
limit 5
""")selectSql.show()

Lab env is as following

1 A Spark application on YARN

As we see in “Run pySpark job on YARN cluster using JEG”, when we start pyspark kernel on notebook UI, JEG submits a Spark job/application to YARN cluster.

We can see this application on YARN resource manager UI at fig1:8088. Scroll to right side to find “Tracking UI” column which brings to Application UI where jobs, stages, tasks etc. info can be find to this application. JEG UI also provide a link “Spark UI” in above screenshot which leads to the same Application UI.

YARN Resource Manager — RUNNING Applications

2 Run the code

2.1 Run first cell to read vaccination data to Spark dataframe from a csv file.

covid_vaccination = spark\
  .read\
  .option("inferSchema", "true")\
  .option("header", "true")\
  .csv("file:///shared/data/COVID-19_Vaccinations_in_the_United_States_County.csv")

We’ll see 2 jobs related to first cell: job id 17 and job id 18.

job 17 is a light weighted job that Spark only quickly scans 1 record of the data file. While job 18 is to scan the whole data file trying to inference data schema like the column type etc. That’s because we specify “inferSchema” as “true” in spark.read.

Note, to achieve better performance, don’t set “inferSchema” to “true” if dataset is large. It takes lot of time. Instead, you can explicitly tell Spark your data schema in spark.read command for example

spark.read
.schema(StructType(List(StructField(“id”,LongType,false))))
.csv(“test.csv”)

2.2 Run second cell to set partition numbers for shuffle write output.

spark.conf.set("spark.sql.shuffle.partitions", "2")

Default Spark shuffle write creates 200 partitions. In our lab env, there are 2 executors on two nodes, kale and onion, so in second cell, we set “spark.sql.shuffle.partitions” to 2. We’d expect Spark to handle 1 partition on 1 executor node.

Another parameter “spark.executor.cores 1” has been set before starting JEG pySpark kernel in spark-defaults.conf on each executor node so 1 executor starts only 1 task. Otherwise, only 1 executor will be seen which sttarts 2 tasks to handle the 2 partitions.

[root@kale1 ~] vi /opt/spark/conf/spark-defaults.conf
spark.executor.cores 1

This setting is to monitor shuffling behavior purpose. In some production cases, multiple tasks in one executor can achieve better performance because they can reuse data in the same executor memory.

But this is to control partition number after shuffle. When Spark reads data file, how many partitions Spark prepares to save the data in memory? In this blog “Building Partitions For Processing Data Files in Apache Spark” Ajay Gupta gives very good explanation.

From above blog, three parameters and one formula are related to calculate how Spark split input data file:

(a)spark.default.parallelism (default: Total No. of CPU cores)
(b)spark.sql.files.maxPartitionBytes (default: 128 MB)
(c)spark.sql.files.openCostInBytes (default: 4 MB)maxSplitBytes = Minimum(maxPartitionBytes, bytesPerCore)while bytesPerCore = 
(Sum of sizes of all data files + No. of files * openCostInBytes) / default.parallelism

No. of CPU cores: 2

maxPartitionBytes: default is 128MB

OpenCostInBytes: default is 4MB

Our data file Covid-19_Vaccinations_in_the_United_States_County.csv is 147.5MB. Use above formula to calculate how Spark would split this data file:

maxSplitBytes = min (128MB, (147.5MB + 1 * 4MB) / 2) = 75.7MB

So we’d expect Spark use 2 partitions for our data file: one is 76MB, the other is 147.5–75.7=71.8MB. We’ll see if this is true later on.

So run this second cell won’t trigger any job as expected.

2.3 Run the rest of code to group vaccination records by state, count how many people are vaccinated, order by desc and show the top 5 destinations.

covid_vaccination.createOrReplaceTempView("covid_vaccination_tbl")
maxSql = spark.sql("""
SELECT Recip_State, sum(Series_Complete_Yes) as recip_total
FROM covid_vaccination_tbl
GROUP BY Recip_State
ORDER BY sum(Series_Complete_Yes) DESC
LIMIT 60
""")
maxSql.cache()maxSql.createOrReplaceTempView("maxSql_tbl")
selectSql = spark.sql("""
SELECT Recip_State, recip_total
FROM maxSql_tbl
limit 5
""")selectSql.show()

Job 19 is triggered and this job has three stages 35, 36 and 37 as following:

I have drawn a diagram to describe partitioning, shuffling in this job as following.

2.3.1 Stage 35

Spark reads the whole data file 147.5MB into memory using 2 tasks for two partitions, one task/partition pair on each executor. One task 55 on node kale reads in 75.8MB and the other task 56 on node onion reads in 71.7MB. Those are about right compared to our prior calculation.

A narrow transformation is executed in stage 35: “group by Recip_State” and “apply ‘sum’ to each state category”. We’re able to verify there are 60 distinct states so shuffle write results have 60 records from above screenshot.

Note, till now, “groupby” and “sum” aggregations apply to the data within each partition. So for each “state”, like “CA”, it has 1 aggregated record in executor 1 and 1 aggregated record in executor 2. Those two records for one state will be aggregated in next stage.

Shuffle write ends up creating shuffle files on two separate nodes local dir. Local dir can be found in executor log. The one on node kale is like

[hadoop@kale1 ~]$ vi /home/hadoop/hadoop/logs/userlogs/application_1637859265233_0008/container_1637859265233_0008_01_000002/stderr
...
21/11/28 15:02:08 INFO DiskBlockManager: Created local directory at /tmp/hadoop-hadoop/nm-local-dir/usercache/hadoop/appcache/application_1637859265233_0008/blockmgr-8a73e634-787c-4537-bdf7-d259691dd634[hadoop@kale1 ~]$ ls /tmp/hadoop-hadoop/nm-local-dir/usercache/hadoop/appcache/application_1637859265233_0008/blockmgr-8a73e634-787c-4537-bdf7-d259691dd634
00  02  04  06  08  0a  0c  0e  10  12  14  16  19  1b  1d  1f  21  23  25  27  29  2e  32  35  38  3a  3c  3e
01  03  05  07  09  0b  0d  0f  11  13  15  18  1a  1c  1e  20  22  24  26  28  2a  30  33  36  39  3b  3d  3f

If you count the data pieces under blockmgr-xxx directory, it’s 60. Those are the shuffle write results on node kale at this moment. You see the same number 60 “records” from above Spark UI as well.

2.3.2 Stage 36

Executor 1 on kale talks to AM to get to know what data needs to read. The answer is executor needs to read data from local and remote node onion. So it can add up the two aggregated records for one state from each partition. What we can see is executor 1 shuffle reads 50 records and executor 2 reads in 70 records.

So again “groupby” and “sum” are being done on top of that resulting 25 records in executor 1 and 35 records in executor 2. They then both shuffle wrote the data.

A little more detail about shuffle read in this stage, take a look at this post.

2.3.3 Stage 37

Spark is smart enough knowing only 1 executor is needed to return final results to Application Manager/Driver. So executor 1 on node kale reads all 60 records data from local (25 records) and remote node onion (35 records). It does the “orderby” and “limit” before returning results to finish this job.

Finally all above shuffle details are saved in executor log file. You can find them in Hadoop logs directory.

/home/hadoop/hadoop/logs/userlogs/application_1637859265233_0008/

Happy Reading!

Join Medium with my referral link - Feng Li

Writing helps ourselves, sharing helps many. It started from study notes for myself with no pressure of perfection…

medium.com

Run The Simple End-to-End Example From Book “Spark: The Definitive Guide” Using Covid-19 vaccination data

1 A Spark application on YARN

2 Run the code

Join Medium with my referral link - Feng Li

Writing helps ourselves, sharing helps many. It started from study notes for myself with no pressure of perfection…

Written by Feng Li