Pyspark on Intellij with packages & auto-complete

Image for post
Image for post

Most of the pyspark folks are used to working with notebooks mostly jupyter and sometimes zeppelin. Notebooks provides a wonderful way to execute code line by line and get evaluated result at every paragraph. As developers we just have one problem with it, the notebook is far from a full blown IDE.

Why IDE

  1. IDE gives a wonderful way to get inside the code of libraries and understand them.
  2. It provides you with wonderful typeahead ( autocomplete)
  3. Lets you run things in debug mode

Lets get rolling

  1. Prerequisites:
  • Install python 3.6
brew install https://raw.githubusercontent.com/Homebrew/homebrew-core/f2a764ef944b1080be64bd88dca9a1d80130c558/Formula/python.rb
python --version

2. Install python plugin for intellij and restart

  • Go to Preferences -> Plugins -> Market Place -> Python -> Install
Image for post
Image for post

Lets get to actual code, we will setup spark deep learning project with Keras : https://towardsdatascience.com/deep-learning-with-apache-spark-part-2-2a2938a36d35

3. Create new project with Intellij, choose Python style, give it a name

Image for post
Image for post

4. Open Project Structure and click on New Project SDK , choose Virtual Environment, Select a location within your project as venv and base interpreter from previously installed python

Image for post
Image for post

5. Lets add requirements.txt from https://github.com/FavioVazquez/deep-learning-pyspark/blob/master/requirements.txt so that we have all the python requirements in. We need to add pyspark and ipython

6. Create a new file SparkDL.py and copy over the contents from notebook, it would show lot of red lines indicating non-linked package. https://github.com/Gauravshah/pyspark-intellij-tutorial/blob/master/SparkDL.py

7. We need to add the deep-learning library, but then we need deep-learning jar as well as transitive dependencies, for this we create pom.xml with deep-learning library. Using Maven installer we copy all the dependencies to the jars folder. https://github.com/Gauravshah/pyspark-intellij-tutorial/blob/master/pom.xml

8. Now we need to add these jars to the IDE environment. Goto Project Structure , Libraries , then add Java lib and select jars folder. This will make your IDE understand the python code inside the jars

9. We still need to add jars to pyspark startup. Open Run and then Edit Configuration . Goto Templates, Python and add following Environment Variables :

  • PYSPARK_SUBMIT_ARGS —-driver-memory 2g --jars jars/*.jar pyspark-shell
  • OBJC_DISABLE_INITIALIZE_FORK_SAFETY YES
  • no_proxy *

10. Add the flower_photo and sample_photos as per tutorial

11. Finally right-click and run your SparkDL

Success!!!

Image for post
Image for post

All of the source code published at : https://github.com/Gauravshah/pyspark-intellij-tutorial

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store