Run The Simple End-to-End Example From Book “Spark: The Definitive Guide” Using Covid-19 vaccination data

Feng Li
7 min readNov 29, 2021

--

In chapter 2 “A Gentle Introduction to Spark” of this book By Bill Chambers, Matei Zaharia, they have provided an end-to-end example to introduce basic Spark concepts. Now we’ll run this very simple code in our lab env described in “Run pySpark job on YARN cluster using JEG” (attached below). The purpose is to see how Spark splits data into partitions and how data shuffling happens under the hood.

flight data used in the end-to-end example is 7KB which is too small to see Spark partition/shuffle behaviors clearly. so we’ll use a bigger data set (147.5MB) from Kaggle called “Covid-19 Vaccinations in the United State, County”. We also changed a little bit the example code in this test.

Code we’ll run is as following

covid_vaccination = spark\
.read\
.option("inferSchema", "true")\
.option("header", "true")\
.csv("file:///shared/data/COVID-19_Vaccinations_in_the_United_States_County.csv")
spark.conf.set("spark.sql.shuffle.partitions", "2")covid_vaccination.createOrReplaceTempView("covid_vaccination_tbl")
maxSql = spark.sql("""
SELECT Recip_State, sum(Series_Complete_Yes) as recip_total
FROM covid_vaccination_tbl
GROUP BY Recip_State
ORDER BY sum(Series_Complete_Yes) DESC
LIMIT 60
""")
maxSql.cache()
maxSql.createOrReplaceTempView("maxSql_tbl")
selectSql = spark.sql("""
SELECT Recip_State, recip_total
FROM maxSql_tbl
limit 5
""")
selectSql.show()

Lab env is as following

1 A Spark application on YARN

As we see in “Run pySpark job on YARN cluster using JEG”, when we start pyspark kernel on notebook UI, JEG submits a Spark job/application to YARN cluster.

JEG UI

We can see this application on YARN resource manager UI at fig1:8088. Scroll to right side to find “Tracking UI” column which brings to Application UI where jobs, stages, tasks etc. info can be find to this application. JEG UI also provide a link “Spark UI” in above screenshot which leads to the same Application UI.

YARN Resource Manager — RUNNING Applications
YARN Resource Manager — RUNNING Applications
Application UI

2 Run the code

JEG UI

2.1 Run first cell to read vaccination data to Spark dataframe from a csv file.

covid_vaccination = spark\
.read\
.option("inferSchema", "true")\
.option("header", "true")\
.csv("file:///shared/data/COVID-19_Vaccinations_in_the_United_States_County.csv")

We’ll see 2 jobs related to first cell: job id 17 and job id 18.

Spark Application UI

job 17 is a light weighted job that Spark only quickly scans 1 record of the data file. While job 18 is to scan the whole data file trying to inference data schema like the column type etc. That’s because we specify “inferSchema” as “true” in spark.read.

Note, to achieve better performance, don’t set “inferSchema” to “true” if dataset is large. It takes lot of time. Instead, you can explicitly tell Spark your data schema in spark.read command for example

spark.read
.schema(StructType(List(StructField(“id”,LongType,false))))
.csv(“test.csv”)

2.2 Run second cell to set partition numbers for shuffle write output.

spark.conf.set("spark.sql.shuffle.partitions", "2")

Default Spark shuffle write creates 200 partitions. In our lab env, there are 2 executors on two nodes, kale and onion, so in second cell, we set “spark.sql.shuffle.partitions” to 2. We’d expect Spark to handle 1 partition on 1 executor node.

Another parameter “spark.executor.cores 1” has been set before starting JEG pySpark kernel in spark-defaults.conf on each executor node so 1 executor starts only 1 task. Otherwise, only 1 executor will be seen which sttarts 2 tasks to handle the 2 partitions.

[root@kale1 ~] vi /opt/spark/conf/spark-defaults.conf
spark.executor.cores 1

This setting is to monitor shuffling behavior purpose. In some production cases, multiple tasks in one executor can achieve better performance because they can reuse data in the same executor memory.

But this is to control partition number after shuffle. When Spark reads data file, how many partitions Spark prepares to save the data in memory? In this blog “Building Partitions For Processing Data Files in Apache Spark” Ajay Gupta gives very good explanation.

From above blog, three parameters and one formula are related to calculate how Spark split input data file:

(a)spark.default.parallelism (default: Total No. of CPU cores)
(b)spark.sql.files.maxPartitionBytes (default: 128 MB)
(c)spark.sql.files.openCostInBytes (default: 4 MB)
maxSplitBytes = Minimum(maxPartitionBytes, bytesPerCore)while bytesPerCore =
(Sum of sizes of all data files + No. of files * openCostInBytes) / default.parallelism

No. of CPU cores: 2

maxPartitionBytes: default is 128MB

OpenCostInBytes: default is 4MB

Our data file Covid-19_Vaccinations_in_the_United_States_County.csv is 147.5MB. Use above formula to calculate how Spark would split this data file:

maxSplitBytes = min (128MB, (147.5MB + 1 * 4MB) / 2) = 75.7MB

So we’d expect Spark use 2 partitions for our data file: one is 76MB, the other is 147.5–75.7=71.8MB. We’ll see if this is true later on.

So run this second cell won’t trigger any job as expected.

2.3 Run the rest of code to group vaccination records by state, count how many people are vaccinated, order by desc and show the top 5 destinations.

covid_vaccination.createOrReplaceTempView("covid_vaccination_tbl")
maxSql = spark.sql("""
SELECT Recip_State, sum(Series_Complete_Yes) as recip_total
FROM covid_vaccination_tbl
GROUP BY Recip_State
ORDER BY sum(Series_Complete_Yes) DESC
LIMIT 60
""")
maxSql.cache()
maxSql.createOrReplaceTempView("maxSql_tbl")
selectSql = spark.sql("""
SELECT Recip_State, recip_total
FROM maxSql_tbl
limit 5
""")
selectSql.show()

Job 19 is triggered and this job has three stages 35, 36 and 37 as following:

I have drawn a diagram to describe partitioning, shuffling in this job as following.

Data partitioning/shuffling in Job 19

2.3.1 Stage 35

Spark reads the whole data file 147.5MB into memory using 2 tasks for two partitions, one task/partition pair on each executor. One task 55 on node kale reads in 75.8MB and the other task 56 on node onion reads in 71.7MB. Those are about right compared to our prior calculation.

A narrow transformation is executed in stage 35: “group by Recip_State” and “apply ‘sum’ to each state category”. We’re able to verify there are 60 distinct states so shuffle write results have 60 records from above screenshot.

Note, till now, “groupby” and “sum” aggregations apply to the data within each partition. So for each “state”, like “CA”, it has 1 aggregated record in executor 1 and 1 aggregated record in executor 2. Those two records for one state will be aggregated in next stage.

Shuffle write ends up creating shuffle files on two separate nodes local dir. Local dir can be found in executor log. The one on node kale is like

[hadoop@kale1 ~]$ vi /home/hadoop/hadoop/logs/userlogs/application_1637859265233_0008/container_1637859265233_0008_01_000002/stderr
...
21/11/28 15:02:08 INFO DiskBlockManager: Created local directory at /tmp/hadoop-hadoop/nm-local-dir/usercache/hadoop/appcache/application_1637859265233_0008/blockmgr-8a73e634-787c-4537-bdf7-d259691dd634
[hadoop@kale1 ~]$ ls /tmp/hadoop-hadoop/nm-local-dir/usercache/hadoop/appcache/application_1637859265233_0008/blockmgr-8a73e634-787c-4537-bdf7-d259691dd634
00 02 04 06 08 0a 0c 0e 10 12 14 16 19 1b 1d 1f 21 23 25 27 29 2e 32 35 38 3a 3c 3e
01 03 05 07 09 0b 0d 0f 11 13 15 18 1a 1c 1e 20 22 24 26 28 2a 30 33 36 39 3b 3d 3f

If you count the data pieces under blockmgr-xxx directory, it’s 60. Those are the shuffle write results on node kale at this moment. You see the same number 60 “records” from above Spark UI as well.

2.3.2 Stage 36

Executor 1 on kale talks to AM to get to know what data needs to read. The answer is executor needs to read data from local and remote node onion. So it can add up the two aggregated records for one state from each partition. What we can see is executor 1 shuffle reads 50 records and executor 2 reads in 70 records.

So again “groupby” and “sum” are being done on top of that resulting 25 records in executor 1 and 35 records in executor 2. They then both shuffle wrote the data.

A little more detail about shuffle read in this stage, take a look at this post.

2.3.3 Stage 37

Spark is smart enough knowing only 1 executor is needed to return final results to Application Manager/Driver. So executor 1 on node kale reads all 60 records data from local (25 records) and remote node onion (35 records). It does the “orderby” and “limit” before returning results to finish this job.

Finally all above shuffle details are saved in executor log file. You can find them in Hadoop logs directory.

/home/hadoop/hadoop/logs/userlogs/application_1637859265233_0008/

Happy Reading!

--

--

Feng Li

Software Engineer, playing with Snowflake, AWS and Azure. Snowflake Data Superhero 2024. SnowPro SME, Jogger, Hiker. LinkedIn: https://www.linkedin.com/in/fli01