Apache Spark Is Fun

Nick Rafferty

Published in

The Startup

11 min readSep 7, 2020

A hands-on tutorial into Spark DataFrames.

TM Apache Spark

If you’ve ever attempted to learn Data Engineering, it’s likely that you’re quickly overwhelmed. Whether it’s complicated terminology or tutorials that assume an extensive base in programming experience, Data Engineering is typically far from accessible.

This guide is meant to not only be accessible to beginners but to be fun along the way! You’ll get your hands on a project and dive into the concept of a Spark Dataframe.

Before we get started, I have one ask of you. Don’t skim this guide! I can promise you that you’ll get much more out of this if you dive into it fully and attempt the exercises yourself.

So without further ado, let’s jump in!

Let’s get your computer set up to run some Spark programs.

For this (and future) tutorials, you’ll need to have IntelliJ, Git, Java, and Apache Spark installed. Follow my step by step guide for the installation walkthrough: Setting up IntelliJ, Git, Java, and Apache Spark

Now that you’re through the installations, the fun begins!

What is Apache Spark?

Beehive filled with worker bees and a queen bee — Photo by Damien TUPINIER on Unsplash

My brain works best with analogies. So hang tight. Think of a beehive. You have a single queen and hundreds or thousands of worker bees. They have very distinct roles. The queen is largely responsible for the ‘brains’ of the entire operation, conducting the orchestra of worker bees who are fulfilling the hundreds of tasks that need to be accomplished. The worker bees are the executors, putting in the work required to accomplish those tasks.

What do Bees have to do with Apache Spark?

Good question, they’ll come in handy soon. The formal definition of Apache Spark is that it is a general-purpose distributed data processing engine. It is also known as a cluster computing framework for large scale data processing. Let’s break that down.

Distributed Data: Spark is built to handle extremely large scale data. The sheer amount of data being loaded into the spark application is enough to overwhelm almost any computer. To handle that, Spark utilizes multiple computers (called a cluster) to process the tasks required for that job and work together to produce the desired output. This is where the bee analogy comes in. Let’s start with a diagram.

In this diagram, the queen bee is an analogy for the Spark Driver. The worker bees are an analogy for the spark executors.

Spark Driver: The Queen Bee of the operation. The Spark Driver is responsible for generating the Spark Context. The Spark Context is extremely important since it is the entryway into all of Spark’s functionality. Using the Cluster Resource Manager (typically YARN, Mesos, or Standalone), the Driver will access and divide work between the cluster of Spark Executors (worker nodes). The Spark Driver is where the main method is run, meaning that any written program will first interact with the driver before being sent in the form of tasks to the worker nodes.

Spark Executors: The worker bees. The executors are responsible for completing the tasks assigned to them by the driver with the help of the Cluster Resource Manager. As they perform the tasks instructed to them, they will store the results in memory, referred to as a cache. If any one of these nodes crashes, the task assigned to that executor will be reassigned to another node to complete the task. Every node can have up to one executor per core. Results are then returned to the Spark Driver upon completion.

We’ve made it through a lot. Let’s start to get a little bit more practical. Don’t worry, the hands-on piece is coming soon!

Key Concept: Spark DataFrames

One of the most important aspects of Spark that we’ll be diving into in this tutorial is the DataFrames API.

You can think of DataFrames as distributed spreadsheets with rows and columns. The spreadsheet is split across different nodes. Each node contains the schema of the DataFrame and some of the rows.

The schema in a DataFrame is a list describing column names and column types. Each schema can hold an arbitrary number of columns. Every row across every node will have exactly the same schema for the DataFrame. Spark uses a concept called partitioning to split the data into files to share with nodes in the cluster.

A few concepts that you should know when working with DataFrames:

They are immutable. Once you’ve created a DataFrame, it cannot be changed.
DataFrames can handle a wide range of data formats. For this tutorial, we’ll start with JSON data.
With the help of Spark SQL, you can query a DataFrame as you would any relational database.

Try Your First DataFrame!

Now for the fun part! Open up a new tab and navigate to this GitHub repository: https://github.com/nickrafferty78/Apache-Spark-Is-Fun.

By now I’m assuming you’ve created a GitHub account yourself. If you haven’t, you’ll need to do that first before moving on to the next step.

First, fork the repository.

Fork the Apache Spark is Fun repository on Github

Next, navigate to your own GitHub and open up the repository that you just forked. (If these concepts are unfamiliar to you I’d encourage you to read more here: https://guides.github.com/introduction/git-handbook/. Click the ‘Code’ button and copy the URL as shown below. Next, type this command into your terminal (make sure to replace the ‘your-github-username’ with your own username):

git clone https://github.com/your-github-username/Apache-Spark-Is-Fun.git

Now let’s get IntelliJ started. Open up IntelliJ and open the project that you just cloned to your computer. Navigate to the Tutorial class under src/main/scala/Tutorial.scala.

Creating the Spark Session

There is a little starter code to help this app get set up. It should look like this:

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.types.LongType
import Utilities._

object Tutorial extends App {

  Utilities.setupLogging()


}

Underneath Utilities.setupLogging(), copy and paste this code:

val spark = SparkSession
  .builder()
  .appName("InstrumentReviews")
  .master("local[*]")
  .getOrCreate()

Congrats, you’ve written your first Spark Session! Let’s break that down.

As we mentioned above, the Spark Session is your entry point to begin using DataFrames. The main things you need to know are:

The .appName method names your Spark Session. Since this section will be using the InstrumentReviews.json dataset, I am going to name it ‘InstrumentReviews’.
The .master method tells spark which URL to connect to. Since we are running this on your own personal computer, ‘local’ tells spark to use your personal computer instead of using an external machine. The ‘[*]’ after local is an interesting piece you might not know. This is one of the core tenets of spark, to be able to distribute data in an extremely efficient way. Your computer likely has more than one core. Each core on your computer can process jobs independently of each other. My own computer has 2 cores, so by telling spark [*], you are telling it to run on every single core in your computer.
Finally, .getOrCreate() creates your session.

Reading Data

We’re going to be working with musical instrument review data in this example. Before reading the data, it’s always good to get a lay of the land with what data you’ll be working with. Navigate to /data/InstrumentReviews.json and you should see an extremely large file filled with instrument review data. Here is an example entry:

{
   "reviewerId":"A2IBPI20UZIR0U",
   "reviewerName":"cassandra tu \"Yeah, well, that's just like, u...",
   "reviewText":"Not much to write about here, but it does exactly what it's supposed to. filters out the pop sounds. now my recordings are much more crisp. it is one of the lowest prices pop filters on amazon so might as well buy it, they honestly work the same despite their pricing,",
   "overall":5.0,
   "summary":"good"
}

Now back in Tutorial.scala, underneath the Spark Session you created, copy, and paste this code:

val firstDataFrame = spark
  .read
  .format("json")
  .option("inferSchema", "true")
  .load("data/InstrumentReviews.json")

You’ve already created the variable spark above, so in this code block, you are utilizing that spark session to read the incoming data.

The .format method specifies what type of structure the data is coming in. Our musical instrument review data is in JSON.
.option with “inferSchema”, “true” is telling spark to infer the schema of the musical instrument review data without you explicitly defining it.
.load gives spark the path to the data file.

Let’s test if that will work!

Underneath your code for the firstDataFrame, write these two lines:

firstDataFrame.show()
firstDataFrame.printSchema()

If all went well you should see something like this:

Console print out of Apache Spark DataFrame.

root
 |-- overall: double (nullable = true)
 |-- reviewText: string (nullable = true)
 |-- reviewerId: string (nullable = true)
 |-- reviewerName: string (nullable = true)
 |-- summary: string (nullable = true)

If you see that, congrats on writing your first DataFrame!

Wait, what does all this mean??

Remember what we said in the above section, a DataFrame is a distributed collection of rows that conform to this schema.

This is your schema:

root
 |-- overall: double (nullable = true)
 |-- reviewText: string (nullable = true)
 |-- reviewerId: string (nullable = true)
 |-- reviewerName: string (nullable = true)
 |-- summary: string (nullable = true)

When you told spark to inferSchema, this is what came up based on the structure of the file in InstrumentReviews.json.

But be careful, you don’t want to get caught with Spark inferring the wrong schema. So let’s write this manually ourselves.

First, remove this line: .option(“inferSchema”, “true”) from your variable firstDataFrame.

Now we are going to write our first schema manually. In spark, you can create a StructType that holds an array of your StructFields.

A what?!

In spark, a StructType is an object that is used to define a schema. It is populated with StructFields that define the name, type, and if the field can be nullable. So let’s see this in action.

Go to the file /data/InstrumentReviews.json. Here is your schema:

{
   "reviewerId":"A2IBPI20UZIR0U",
   "reviewerName":"cassandra tu \"Yeah, well, that's just like, u...",
   "reviewText":"Not much to write about here, but it does exactly what it's supposed to. filters out the pop sounds. now my recordings are much more crisp. it is one of the lowest prices pop filters on amazon so might as well buy it, they honestly work the same despite their pricing,",
   "overall":5.0,
   "summary":"good"
}

Ok so we know that we need to name the first column ‘reviewerID’ and it will be a StringType.

Let’s start writing our schema based on that. Go to Utilities.scala and let’s manually create our Schema.

Again you should see starter code like this:

import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.types.{ArrayType, DoubleType, IntegerType, LongType, StringType, StructField, StructType}

object Utilities {

  def setupLogging() = {
    Logger.getLogger("org.apache.spark").setLevel(Level.ERROR)
  }
}

Underneath and outside of the setupLogging method, declare a variable.

val instrumentReviewsSchema = StructType(Array(
/* Your code goes here*/
))

Now try to complete this part without checking the solution. What would you fill this with?

You might have been a little off on the syntax but hopefully, you were able to guess based on what was explained above. Here is what the entire instrument review schema should look like:

val instrumentReviewsSchema = StructType(Array(
  StructField("reviewerId", StringType, nullable = true),
  StructField("reviewerName", StringType, nullable = true),
  StructField("reviewText", StringType, nullable = true),
  StructField("overall", DoubleType, nullable = true),
  StructField("summary", StringType, nullable = true),
))

You’ll need to tell spark to include the schema you just created. So go back to Tutorial.scala and replace the option infer schema true with .schema(instrumentReviewsSchema)

Let’s run it!

Try it yourself: Exercise #1

Now it’s time for your first exercise. Navigate to src/main/scala/ExerciseOne.scala You’ll complete the next steps in that file.

Go to data/yelp.json and start to understand the structure of the JSON file coming in.
In Utilities.scala, write a schema for yelp.json. You’ll have a challenge that we haven’t gone over yet with objects inside of a JSON field. I encourage you to research what that will look like.
Create a spark session. You are welcome to call this whatever you like. In my solution I will call it “YelpReviews”
Read the DataFrame from the data/yelp.json file. Make sure to include the schema you wrote in #2.
Show and print schema for the DataFrame you wrote.

That’s it! This will get you started on your first DataFrame. Make sure to read the notes above if you get stuck. Don’t continue reading until you’ve completed it!

If you get stuck, make sure to refer to the notes above or try to Google to figure it out before checking the solution below.

Spoiler alert! Here are the answers to the exercises.

All of these are written out in AnswerKey.scala.

This structure is slightly more complicated than the musical instrument's data that we’ve worked with. I hope that was a little challenge for you. Here is what the data looks like:

{
   "name":"Peace of Mind and Body Massage",
   "city":"Akron",
   "stars":5.0,
   "review_count":3,
   "state":"OH",
   "hours":{
      "Friday":"9:0-17:0",
      "Monday":"9:0-17:0",
      "Saturday":"9:0-20:0",
      "Tuesday":"9:0-17:0",
      "Wednesday":"9:0-17:0"
   }

2. The big challenge with this data set is the nested JSON object within “hours”. To handle this, you’ll need to do what we’ve already done, and write a StructType array filled with StructFields within the hours Structfield. Here is an example:

StructField("hours", StructType(
  Array(
    StructField("Monday", StringType),
    StructField("Tuesday", StringType),
    StructField("Wednesday", StringType),
    StructField("Thursday", StringType),
    StructField("Friday", StringType),
    StructField("Saturday", StringType),
    StructField("Sunday", StringType)
  )
))

Here is the entire schema:

val yelpSchema = StructType(Array(
  StructField("name", StringType, nullable = true),
  StructField("city", StringType, nullable = true),
  StructField("stars", DoubleType, nullable = true),
  StructField("review_count", IntegerType, nullable = true),
  StructField("hours", StructType(
    Array(
      StructField("Monday", StringType),
      StructField("Tuesday", StringType),
      StructField("Wednesday", StringType),
      StructField("Thursday", StringType),
      StructField("Friday", StringType),
      StructField("Saturday", StringType),
      StructField("Sunday", StringType)
    )
  )),
))

3. The Spark Session is very similar to the one we wrote above. Your answer should have looked like this:

val spark = SparkSession
  .builder()
  .appName("YelpReviews")
  .master("local[*]")
  .getOrCreate()

4. With the schema you’ve defined, read the yelp.json file.

val yelpDF = spark.read
  .schema(yelpSchema)
  .format("json")
  .load("data/yelp.json")

5. Finally show and print.

yelpDF.show()
yelpDF.printSchema()

Your output should look something like this:

Console print out of Spark program with the Yelp Dataset.

root
 |-- name: string (nullable = true)
 |-- city: string (nullable = true)
 |-- stars: double (nullable = true)
 |-- review_count: integer (nullable = true)
 |-- hours: struct (nullable = true)
 |    |-- Monday: string (nullable = true)
 |    |-- Tuesday: string (nullable = true)
 |    |-- Wednesday: string (nullable = true)
 |    |-- Thursday: string (nullable = true)
 |    |-- Friday: string (nullable = true)
 |    |-- Saturday: string (nullable = true)
 |    |-- Sunday: string (nullable = true)

And that’s it!

Huge congrats! You have successfully made it through your first exercise in Spark. You now know how to create a spark session, write a schema, and read a JSON file in Spark. Not to mention you’ve set yourself up to be able to learn a ton more about Spark. There is much more to learn, so get ready for part two where we continue into writing DataFrames and performing aggregations.