PySpark for Data Noobs Part 1 — Local PySpark Setup

Farooq Mahmud
3 min readMar 22, 2020

--

One of the benefits of working with a forward-thinking company like Marel is the opportunity to be challenged in different domains. In the next few months, I will get to deep dive daily into the data engineering domain — uncharted territory for me.

Azure Databricks is a crucial SaaS platform used at Marel for processing lots of data. I would summarize Azure Databricks as an easy way to spin-up an Apache Spark cluster. That’s all I am going to say about Spark and Databricks because these two technologies have been wonderfully explained by those much more knowledgeable than I. Here is an overview of Azure Databricks. Here is an overview of Apache Spark. For this article, it is safe to assume Azure Databricks and Apache Spark as the same thing, and I will be only referring to Apache Spark from this point forward.

The title of this article mentions “PySpark.” What in the world is that? Because Spark is a data processing platform, there must be an interface that allows you to process data. PySpark is one of those interfaces. More precisely, it enables you to process data using a Python API. A lot of people in the data engineering space know Python, so this is a good thing.

Fun fact: Apache Spark is written in Scala. Scala runs on the Java Virtual Machine. Therefore Python needs to be translated into JVM instructions to run on Spark, meaning Python is necessarily less performant than the equivalent Scala code. However, I would argue that the tradeoff between Python’s ease of use vs. Scala’s performance is worth it.

Earlier, I alluded to the fact that Apache Spark is not the easiest platform to get up and running. Azure Databricks makes this somewhat easier, but you need an Azure subscription. The real value in Databricks is its ability to scale. I don’t need that right now. My approach when learning something new is to try to understand the fundamentals before adding bells and whistles. Just give me an easy way to set up Spark on my laptop and run code. I don’t need a multi-node cluster because I won’t be running intense analyses. When I get to the point where I need the power of multiple nodes, I’ll set up an Azure Databricks cluster.

Thanks to Docker, getting a local PySpark environment up and running is straightforward:

  1. Pull the jupyter/pyspark-notebook Docker image.
  2. Start a container using the image. Do not change the port number. For example:
docker run -it --rm -p 8888:8888 jupyter/pyspark-notebook
  1. The container will output a URL to browse to. It will look something like this:
    http://127.0.0.1:8888/?token=f9209b0077ca72334b2d608ecc51f4106c
  2. Browse to your URL, and the Jupyter IDE opens.

If you see something like the image below, then you have cleared the first hurdle.

Jupyter is an IDE for running PySpark code

The second hurdle to clear is loading the PySpark library in a new notebook.

Click the New button at the right and select Python3. You should see a cell where you can write code. Type import pyspark and press Shift+Enter to run the cell.

Now type pyspark followed by a dot. Then press tab. This activates auto-completion. Arrow down to SparkContext and press enter. Then press Shift+Enter to run the cell. The expected result is shown below.

Finally, let’s explore the available magic commands. Magic commands are similar to macros. To list the available magic commands type %lsmagic in a cell and press Shift+Enter to run the cell. The output should look like this:

Fun fact: Magic commands can vary by platform. For example, Azure Databricks has a markdown magic command (%md), which does not exist in the Docker image.

You are now set up to do fun data analyses using Spark and Python!

--

--

Farooq Mahmud

I am a software engineer at Marel, an Icelandic company that makes machines for meat, fish, and poultry processing.