Get started with Pyspark on Mac using an IDE-PyCharm

3 min readApr 29, 2019

I found running Spark on Python in an IDE was kinda tricky, hence writing this post to get started with development on IDE using pyspark

I have been writing all my spark jobs using IntelliJ + sbt in Scala for a while now decided to give it a try to use Pyspark on IDE. It was definitely not straight forward. Writing pySpark code generally in a pyspark-shell or using Jupyter notebook is relatively easy to get started quickly.

1. Installing Homebrew:

You would need homebrew to be installed on your Mac.

/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

2. Installing Python:

Next thing is Python, I had python2.7 already on my machine but wanted to develop my pyspark applications with Python3.6 to try out the new Koalas library by Databricks.

So there is a kool way of running multiple Python versions. The easiest is to use pyenv .

brew install pyenv

Now, add the below lines to your ~/.bash_profile to change the Pyenv root path and makes sure pyenv is initialized.

export PYENV_ROOT=/usr/local/opt/pyenv
eval "$(pyenv init -)"

Now, to enforce the changes to your bash_profile, run this. Please note that $ is required as a part of the command.

$SHELL -l

Now you can install the required python version.

pyenv install 3.6.8 #Install python
pyenv rehash        #Rebuilds the shim files

The main advantage of pyenv is that, you can set the scope of the Python version based on the folder/directory where you are running this.

In my case, I need this only in my Pyspark project.

cd ~/PycharmProjects/pyspark-examples
pyenv local 3.6.8

You can verify it by running:

~/PycharmProjects/pyspark-examples $ python
Python 3.6.8 (default, Apr 29 2019, 15:40:26)
[GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.10.44.4)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>

3) Download Spark Binaries and set the right env variables:

On your terminal:

$curl -O http://ftp.wayne.edu/apache/spark/spark-2.4.2/spark-2.4.2-bin-hadoop2.7.tgz$ tar -xzf spark-2.4.2-bin-hadoop2.7.tgz
$ sudo mv spark-2.4.2-bin-hadoop2.7 /opt/spark
$ export SPARK_HOME=/opt/spark
$ export PATH=$SPARK_HOME/bin:$PATH

Verify by running, pyspark . It should give something along the lines of:

$ pyspark
Python 3.6.8 (default, Apr 29 2019, 15:40:26)
[GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.10.44.4)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
19/04/29 17:11:49 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.4.2
      /_/Using Python version 3.6.8 (default, Apr 29 2019 15:40:26)
SparkSession available as 'spark'.
>>>

4) Installing `findspark`

Now why we need this?! Trust me even I had the same question. The thing is PySpark isn’t on sys.path by default, so what findspark does is adds pyspark to sys.path at runtime.

To install findspark , we can either download using pip installer or add it in the requirements.txt so that IDE(PyCharm) downloads it for me.

Clicking on the Install requirement should do the trick for you. If it doesn’t navigate to Pycharm Preference(CMD + ,)→Tools → Python Integrated Tools and Update the Package Requirements File: to the requirements.txt file location.

Something like: /Users/PycharmProjects/pyspark-examples/requirements.txt

5) Finally, write some Spark code!!

All we have to do is just import findspark and write our regular spark code.

import findspark
findspark.init("/opt/spark")
from pyspark import SparkContext
sc = SparkContext(appName="getEvenNums")
x = sc.parallelize([1, 2, 3, 4])
y = x.filter(lambda x: (x % 2 == 0))
print(y.collect())
sc.stop()

This is definitely not the only way but 1 of the way I found as I was exploring it myself. There might be an easier way or a familiar way as well. Thanks for reading! Please do share the article, if you liked it. Any comments or suggestions are welcome! Check out my other articles here.