Debugging PySpark with PyCharm and AWS EMR

Flomin Ron
Explorium.ai
Published in
4 min readMay 4, 2021
Debugging PySpark with PyCharm and AWS EMR

Have you ever found yourself developing PySpark inside EMR notebooks? Have you debugged PySpark locally, but wanted to run it over a real and big data set and couldn’t because your computer lacked resources?

On one of my projects at Explorium, I was implementing a scalable entity resolution process to automatically match and merge entities from dozens of data sources. (Stay tuned for more details in the following articles).

This process is very resource-heavy and so I decided to implement it using PySpark.

With PySpark I had the advantages of being able to run very complex processes with high computation abilities, and being able to use our entities data model which is implemented in Python.

When I started implementing the process I faced the issue of not being able to debug my process against a real data set on my local computer, so I tried to do so against AWS EMR.

I was sure I could find some solutions in a quick search, but surprisingly, I didn’t. I even opened a Stack Overflow thread regarding this most basic need: “How to debug PySpark on EMR using PyCharm”, but no one answered.

After doing some research, I would like to share my insights on how to debug PySpark with PyCharm and AWS EMR with others.

Requirements:

  • PyCharm PRO (I used version 2020.3.2)
  • A running AWS EMR cluster with spark installed (I used version emr-6.2.0)

The first step is to create an SSH Python interpreter.

Fill in the host of the AWS master public DNS (this can be found inside the EMR UI), and put “hadoop” as the username.

Afterward, use your pem / ppl file for authentication (look here for more details).

Use this as the Python path: “/usr/bin/python3.7”. This is where the Python is located over the master server.

It will automatically create a temporary folder in the master server for the project’s files.

Next, let’s make sure that we have a remote deployment configuration so all of the project files will be uploaded to the master server.

In your Pycharm, go to:

  1. Tools
  2. Deployment
  3. Configuration

Make sure that you have an existing configuration that points to the SSH configuration you have just created.

If not, create one:

  1. Click on the + symbol at the top left
  2. Choose SFTP
  3. Name your deployment configuration
  4. Point your deployment configuration to your SSH configuration

Eventually, your configuration should look like this:

Next, click on mappings and make sure that the local path points to the current project.

If the deployment path is empty, please fill it with this value: /tmp/pycharm_project_<project_name>

Exit the popup screen and go again to Tools > Deployment and make sure that “Automatic Upload” is on. It makes sure that when we submit some changes in the files, the changed files will be uploaded automatically.

Last but not least, edit the “Run/Debug” configuration of your runnable file and add these values to the environment variables:

YARN_CONF_DIR=/etc/hadoop/SPARK_HOME=/usr/lib/sparkHADOOP_HOME=/usr/lib/hadoopPYSPARK_PYTHON=/usr/bin/python3PYTHON_PATH=/usr/bin/python3

These variables are important so that PyCharm can point to the Spark, Hadoop and yarn files.

Summary

So what did we just do?

  1. We created an SSH configuration so PyCharm can know where to connect.
  2. Then, we specified the Python path, so PyCharm can know where Python is located on the master server.
  3. We created a remote deployment that takes care of uploading the project files into the master server.
  4. We added important environment variables.

And that’s it. You should be able to run and debug your PySpark application from your local PyCharm.

Ron Flomin is an expert data and machine learning engineer who is a member of the data team of Explorium. Explorium offers the industry’s first automated External Data Platform for Advanced Analytics and Machine Learning. Explorium empowers data scientists and business leaders to drive decision-making by eliminating the barrier to acquiring and integrating the right external data and dramatically decreasing the time to superior predictive power. Learn more at www.explorium.ai

--

--