How does PySpark work? — step by step (with pictures)

Published in

Analytics Vidhya

15 min readJun 3, 2020

Do you find yourself talking about Spark without really understanding all the words you’re using? Do you feel like you don’t have a clear mental model of how Spark is used in Python?

This guide is meant to help you get up to speed in a couple of minutes.

(If you don’t know what Spark is, best start here, then play around with the code, and come back to this article later).

Key questions

If you’ve worked with data for a while, especially Big Data, you probably know Spark is a pretty great tool. And if you’re like me, and you use Python for pretty much everything, you’ve probably come across PySpark — aka the Python API for Spark.

But what is PySpark, actually?
And isn’t Python kind of slow?
Why would anyone do Big Data processing in this language?
What actually happens to my data when I use PySpark?

We’ll answer those questions in just a moment.

Before we jump into the nitty-gritty, though — let’s have a look at some core Data Engineering concepts. I promise, it will help.

Key mental models

(feel free to skip this entire section if this is all old hat to you)

Let’s have a brief look at the following terms :

numbers every programmer should know
big data
thrashing
parallel computing
inter-process communication
JVM

You’ll want to understand each of these concepts before you deep-dive into (Py)Spark.

Why?

You want your knowledge to form a solid tree, not a heap of leaves. So we’re starting with the trunk and the thick branches— the core concepts. This will make learning the finer-grained details much more fun and manageable.

Also, you don’t want this core knowledge to evaporate after 2 weeks. If you want to remember what you’ve read here, you need to use spaced repetition. Anki is a great tool to quickly get you going with that.

Here’s a ready-made Anki deck for you — this will help you firm up what you have learned from this article

Numbers every programmer should know

Speed: processor > memory > disk > network

Several years ago, a very clever man by the name of Peter Norvig (Research Director at Google) posted online a table of numbers describing the speed of various data-related operations that take place on a computer.

These numbers came to be known as the numbers every programmer should know and they happen to be pretty important for understanding Data Engineering.

Without getting into too much detail, here’s the gist:

Processors are extremely fast. Memory is also quite fast. Disks are slow-ish. Accessing other computers in a network is even slower.
An overworked computer is extremely slow.*

* (my own words, not Peter Norvig’s)

This is a (slightly dated) analogy to help you understand what “fast” and “slow” means on a computer. Assume you’re in a hotel room in Seattle, USA, and that you need to walk somewhere to get your data.

Big data

Difficult to process on your laptop

“Big data” is an amount of data that can’t be comfortably processed on your computer. It’s exactly the amount of data that might make your PC start working really slowly.

You know you’re dealing with big data when it’s faster to send some part of it to another computer over the network, and divide the computation between 2 processes on two different computers.

Parallel computing

Multiple computers crunching numbers together

It’s when two processes (programmes) are crunching through different parts of the same dataset at the same time. It’s often faster than one process trying to do all the work.

Distributed computing is similar, but the processes live on different computers. Spark very much takes advantage of distributed computing.

Thrashing

Too much data will choke your PC

Your computer can slow down massively if some programme is using too much memory.

Processes need random-access memory (RAM) to run fast. When you start a process (programme), the operating system will start assigning it memory. But what if there’s not enough memory to fit all of your data in it?

If your programme is trying to process more data than it has got memory, it will start to spend a lot of time saving and reading data from the disk instead. This is pretty slow.

Imagine you’re sitting a Maths exam and you’re trying to solve a long equation. Suddenly, you run out of space on your exam sheet, and have to start scratching your notes onto a clay tablet! This will take a lot of time, and slow you down. Similarly, your application is doing less computation because it’s busy doing all the housekeeping!

To make matters worse, your programme is not talking to the processor a lot anymore (it’s busy scratching on the clay tablet, aka your hard disk). That makes the processor think you don’t need it as much and it will start accepting other jobs, which puts even more strain on the available resources.

The computation will probably finish at some point, but it will take a LOT of time. You want to avoid this kind of situation at all costs. It might be faster to send some of the data to another computer, even though sending things over a network is relatively slow (in the world of computers).

Inter-process communication (IPC)

It’s when one program has to talk to another

How do programs talk to each other? It’s not allowed for one process to create an object in memory (for example a Python dictionary) and then pass this bit of memory directly to another process.

Two programs are not allowed to access each other’s assigned memory space, so they usually have to “write stuff down” somewhere else, in a place that is accessible to both processes.

In short — processes leave messages for one another. It can be a file. It can be data sent into a network socket (e.g. think of what happens when you send a web form over the internet).

For example, in Python, two different processes can save data for one another as a bunch of bytes on the hard disk (Python calls it “pickling”).

In the world of computers, IPC is something that is rather slow.

JVM — Java Virtual Machine

It’s the way Java processes run on your computer

Java has its own special way of translating code (text) into instructions for the computer — the JVM. It makes Java pretty fast and portable. For that reason, many languages have been designed to run in a JVM — Java itself, Scala, Clojure, and even fun experiments such as Jython.

For simplicity, whenever you see the word JVM in this article, just think “Java”.

Now that we’ve covered off the relevant CS concepts, let’s relate them to how PySpark works, and what are its strengths and weaknesses.

PySpark step-by-step

What happens when you type `pyspark` in your Terminal?

This command launches a Python programme you can interact with. It also launches… a JVM programme!

You can see the running Python and JVM processes by using ps aux :

So what PySpark does, is it allows you to use a Python programme to send commands to a JVM programme named Spark!

Confused?

Well the key point here is that Spark is written in Java and Scala, but not in Python. All the computation, all the query optimization, all the cool Spark stuff happens outside of the Python programme.

So why have the Python application in the first place? Doesn’t it make things more complicated?

Well, yes. But Spark is very useful to Data Scientists, Data Analysts and other folks who help your business thrive, but often don’t like using Java/Scala.

Data Scientists are too busy to study Java

So instead of having people learn Java or Scala, someone came up with a way to control Spark via Python.

The `spark` object in PySpark

Your PySpark shell comes with a variable called spark . And as variables go, this one is pretty cool.

For example, you can launch the pyspark shell and type spark.sql(.....) and immediately do some transformations on data from some database. In one line of code!— that’s pretty great.

This is because the spark variable gives you access to something called a sparkContext .

This sparkContext you interact with is a Python object from the pyspark library. As you can see, it seems to do a lot of important things: runJob , stop , setLogLevel , cancelAllJobs… :

>>> dir(spark.sparkContext)[... 'accumulator', 'addFile', 'addPyFile', 'appName', 'applicationId', 'binaryFiles', 'binaryRecords', 'broadcast', 'cancelAllJobs', 'cancelJobGroup', 'defaultMinPartitions', 'defaultParallelism', 'dump_profiles', 'emptyRDD', 'environment', 'getConf', 'getLocalProperty', 'getOrCreate', 'hadoopFile', 'hadoopRDD', 'master', 'newAPIHadoopFile', 'newAPIHadoopRDD', 'parallelize', 'pickleFile', 'profiler_collector', 'pythonExec', 'pythonVer', 'range', 'runJob', 'sequenceFile', 'serializer', 'setCheckpointDir', 'setJobDescription', 'setJobGroup', 'setLocalProperty', 'setLogLevel', 'setSystemProperty', 'show_profiles', 'sparkHome', 'sparkUser', 'startTime', 'statusTracker', 'stop', 'textFile', 'uiWebUrl', 'union', 'version', 'wholeTextFiles']

Let’s see what this mysterious SparkContext actually is, then.

What does pyspark.SparkContext really do?

Let’s have a look at its official documentation:

Main entry point for Spark functionality. A SparkContext represents the connection to a Spark cluster, and can be used to create RDD and broadcast variables on that cluster.

Well, that’s a bit misleading. The only thing that a SparkContext in Python really does, is it connects to a network port on your computer, and through that port, it reaches out to a SparkContext object in the JVM programme (Spark). Then your Python SparkContext tells the Java SparkContext what you want done.

Imagine you’re on your private jet, and you tell your assistant, John Smith, you want to go to Nairobi, Kenya. John Smith goes to the cockpit and talks the pilot (also a John Smith). Then the pilot takes the plane to Nairobi. They’re named the same, but only John Smith-the-pilot does the actual flying.

Similarly, the SparkContext in Java is the one which actually does all the cool, heavy Spark stuff, like starting jobs and receiving results. The Python object is essentially a mouthpiece that you talk through to the Java object.

Similarly, the entire PySpark library is essentially a thin layer of Python sitting next to heavy-duty Java machinery.

Java SparkContext — Pilot (does the flying)

Python SparkContext — Not a pilot (only does the talking)

Wait, how does PySpark talk to the Java process? What is this devilry?

PySpark is able to make stuff happen inside a JVM process thanks to a Python library called Py4J (as in: “Python for Java”). Py4J allows Python programmes to:

open up a port to listen on (25334)
start up a JVM programme
make the JVM programme listen on a different network port (25333)
send commands to the Java process and listen for responses

author: Piotr Łusakowski, https://deepsense.ai/cooperative-data-exploration/

Py4J can build a bridge between a Python app and a JVM app!

Small wonder PySpark’s authors decided to use that library to create a bridge between the Python shell and Spark.

Uh, why does PySpark talk to Spark via a port? Didn’t you say network connections were slow?

Yes. Using PySpark makes your Spark-based applications slower. You’re dealing with Inter-Process Communication here!

If you want your Python process and your JVM process to communicate with each other, they can’t just have access to the same data in memory — the operating system does not allow that.

So the two processes can either write (serialize) messages into a file, or they can talk to each other over a network socket. This requires serialization and deserialization of data as well.

Why is that a problem?

Imagine loading a 4GB pandas dataframe in Python and doing something to it, then sending it to Spark. This 4GB dataframe will have to be serialized (pickled) into a stream of bytes, passed via temporary file or special server to the JVM process, which will then deserialize it… And then Spark will have to serialize the results again to give it back to Python.

There’s no doubt using PySpark makes your Spark-based application do a lot of extra stuff.

What are the benefits of using PySpark then?

As we said earlier, a lot of people who work with data love PySpark, because it allows them to use Python. Python is pretty great for whenever you need to prototype something fast, explore data, or think through a problem.

It also has a low learning curve, which means people with domain knowledge (Analysts) can come in and easily turn their knowledge into code, without being seasoned programmers. This can be pretty great for your business since knowledge gets quickly turned into solutions (the fact that these solutions sometimes make developers want to cry is another discussion :) ).

Can I see an example of what happens to my data step-by-step?

Sure. Let’s walk through a minimal example of executing a job from PySpark.

Let’s start our Python shell and the JVM:

pyspark

You can see Python and Java running, and a tiny bit of network communication between the two processes:

Now, let’s create some data in Python:

>>> some_string = spark.sparkContext.parallelize("hello hello hello")

Nothing has happened yet. If we go to localhost:4040, we can see that the Spark UI is not showing any jobs yet.

That’s expected. parallelize is executed lazily. That’s because some_string represents a Spark RDD (a interface for working with distributed datasets):

>>> type(some_string)
<class 'pyspark.rdd.RDD'>

A lot of RDD operations are lazy — that’s part of the design of Spark. They will not execute unless you call an action operation on the RDD (i.e. you ask Spark to give you a result).

Let’s call an action on the data:

>>> some_string.take(5)
['h', 'e', 'l', 'l', 'o']

Nice, we can see Spark has executed something. And you can see Python has sent a request to the JVM, and the JVM has responded (and performed some additional network activity):

So what happened to your data in this simple example?

Python:

Serialize "hello hello hello" -> temporary file 
Tell JVM (via Py4J) to pick up the file and create a Java RDD ("parallelize" the data)
Create a Python variable to store information about the Java RDD

JVM

Read the temporary file into a collection of (byte) arrays
 
Create a Java RDD object from this collection (partitions and distributes the data in memory - this is the point of the parallelize operation)Tell Python where to find the object in the JVM

Python

Ask JVM for the first 5 records from the data stored in Spark's memory

JVM

Return results via Py4J

Python

Unpickle incoming data into a string
Display result
['h', 'e', 'l', 'l', 'o']

In this simple case, Spark didn’t do much anything useful — it simply took the serialized data from Python, chopped it up into partitions and stored them in its own (distributed) memory.

Storing data in-memory across many computers is very useful when processing Big Data

What if we wanted to run some kind of transformation, like uppercasing the text data? How does the JVM know how to execute Python code, like .upper(), on a string? Does Spark map all the functions in the Python language to the equivalent functions in Scala/Java?

It doesn’t. Spark creates Python worker processes to execute Python functions.

It hands them the serialized data, and some serialized Python function(s) to execute.

Let’s create a much larger input to illustrate this:

>>> something = 'hello ' * 1000000
>>> another_string = spark.sparkContext.parallelize(something)

Let’s run a map function on it and request a result:

>>> another_string.map(lambda a: a.upper()).take(100)
20/06/01 16:39:24 WARN TaskSetManager: Stage 2 contains a task of very large size (1507 KB). The maximum recommended task size is 100 KB.['H', 'E', 'L', 'L', 'O', ' ', 'H', 'E', 'L', 'L', 'O', ' ', 'H', 'E', 'L', 'L', 'O', ' ', 'H', 'E', 'L', 'L', 'O', ' ', 'H', 'E', 'L', 'L', 'O', ' ', 'H', 'E', 'L', 'L', 'O', ' ', 'H', 'E', 'L', 'L', 'O', ' ', 'H', 'E', 'L', 'L', 'O', ' ', 'H', 'E', 'L', 'L', 'O', ' ', 'H', 'E', 'L', 'L', 'O', ' ', 'H', 'E', 'L', 'L', 'O', ' ', 'H', 'E', 'L', 'L', 'O', ' ', 'H', 'E', 'L', 'L', 'O', ' ', 'H', 'E', 'L', 'L', 'O', ' ', 'H', 'E', 'L', 'L', 'O', ' ', 'H', 'E', 'L', 'L', 'O', ' ', 'H', 'E', 'L', 'L']

You can see the JVM sending out a large file. This is the input data being sent to the Spark cluster to be parallelized (i.e. to live in Spark workers’ memory).

(Caveat: here, the “cluster” isn’t very impressive — it’s just one process on your local computer; but normally it would be a group of processes on a group of computers in the cloud).

Interestingly, you can see new Python processes being created (these are the Python workers we mentioned earlier):

Each of those PySpark processes unpickles the data and the code they received from Spark. They execute the .upper() function on the data, and then serialize the results again into the Pickle format. The serialized data is passed back into Spark and stored in Spark workers’ memory.

Finally, Spark sends the result (1kb large) back to our original PySpark process.

PySpark then deserializes the results and prints out a Python array for you.

Whew, that’s a lot of back and forth between various Python and JVM processes.

Fortunately, you’ll rarely simply create a million lines of text by hand and work on a bare RDD. You’ll usually read the data from some SQL database, or from pandas. Your data will also be more structured —for example, there’ll be columns with names.

This allows PySpark to use more optimized solutions, like the DataFrame class or Apache Arrow serialization format, and let Spark do most of the heavy computation (data joins, filtering etc).

Depending on your code, the Python processes will have to do either very little, or a lot of work. A good example is deciding whether to do a table join in pandas or in Spark — of course, you want the latter. It’s not called PySpark for nothing.

Summary

What is PySpark, actually?

It’s a Python wrapper over the actual Spark application, which is written in Java and Scala.

Isn’t Python kind of slow?

Yes, and using PySpark makes things even slower — it has to talk to the actual Spark over a network connection. On top of that, your data gets serialized and deserialized a lot throughout the job execution.

Why would anyone do Big Data processing with PySpark?

Because PySpark is extremely convenient and gets people productive.

What happens to my data when I use PySpark?

The data gets serialized into a file and picked up by the Spark JVM process. Spark distributes the data in its workers’ memory.

Spark can then run built-in Spark operations like joins, filters and aggregations on the data — if it’s able to read the data.

Otherwise, Spark can launch a group of new Python processes, pass them some serialized Python code and the serialized data and ask them to execute the code on the data.

In either case, the results are stored as serialized data in Spark’s memory again.

Finally, your data is sent back from Spark, and deserialized from a bunch of bytes into Python objects. You can now print those objects, pass them into other functions etc.

Note:

I’ve simplified some of the implementation details so as to help you focus on the key mental models. I’ll write a more advanced article in the future to cover the more geeky minutiae.

Attributions

header image — https://www.reddit.com/r/woahdude/comments/55klru/lightning_snake/?utm_source=share&utm_medium=web2x

“thrashing” — https://upload.wikimedia.org/wikipedia/commons/6/67/Thrashing.GIF

“data scientist” — https://www.pxfuel.com/en/free-photo-jmtew

“storage latency” — https://blog.codinghorror.com/content/images/2014/May/storage-latency-how-far-away-is-the-data.png

About The Author

Obi is a Data Engineer with a background in software testing. He likes calisthenics, complaining, and geeking out about productivity.

You can connect with him on LinkedIn.