Pyspark on Intellij with packages & auto-complete

Gaurav M Shah
3 min readDec 13, 2018

--

Most of the pyspark folks are used to working with notebooks mostly jupyter and sometimes zeppelin. Notebooks provides a wonderful way to execute code line by line and get evaluated result at every paragraph. As developers we just have one problem with it, the notebook is far from a full blown IDE.

Why IDE

  1. IDE gives a wonderful way to get inside the code of libraries and understand them.
  2. It provides you with wonderful typeahead ( autocomplete)
  3. Lets you run things in debug mode

Lets get rolling

  1. Prerequisites:
  • Install python 3.6
brew install https://raw.githubusercontent.com/Homebrew/homebrew-core/f2a764ef944b1080be64bd88dca9a1d80130c558/Formula/python.rb
python --version

2. Install python plugin for intellij and restart

  • Go to Preferences -> Plugins -> Market Place -> Python -> Install

Lets get to actual code, we will setup spark deep learning project with Keras : https://towardsdatascience.com/deep-learning-with-apache-spark-part-2-2a2938a36d35

3. Create new project with Intellij, choose Python style, give it a name

4. Open Project Structure and click on New Project SDK , choose Virtual Environment, Select a location within your project as venv and base interpreter from previously installed python

5. Lets add requirements.txt from https://github.com/FavioVazquez/deep-learning-pyspark/blob/master/requirements.txt so that we have all the python requirements in. We need to add pyspark and ipython

6. Create a new file SparkDL.py and copy over the contents from notebook, it would show lot of red lines indicating non-linked package. https://github.com/Gauravshah/pyspark-intellij-tutorial/blob/master/SparkDL.py

7. We need to add the deep-learning library, but then we need deep-learning jar as well as transitive dependencies, for this we create pom.xml with deep-learning library. Using Maven installer we copy all the dependencies to the jars folder. https://github.com/Gauravshah/pyspark-intellij-tutorial/blob/master/pom.xml

8. Now we need to add these jars to the IDE environment. Goto Project Structure , Libraries , then add Java lib and select jars folder. This will make your IDE understand the python code inside the jars

9. We still need to add jars to pyspark startup. Open Run and then Edit Configuration . Goto Templates, Python and add following Environment Variables :

  • PYSPARK_SUBMIT_ARGS —-driver-memory 2g --jars jars/*.jar pyspark-shell
  • OBJC_DISABLE_INITIALIZE_FORK_SAFETY YES
  • no_proxy *

10. Add the flower_photo and sample_photos as per tutorial

11. Finally right-click and run your SparkDL

Success!!!

All of the source code published at : https://github.com/Gauravshah/pyspark-intellij-tutorial

--

--