Pyspark on Intellij with packages & auto-complete
Most of the pyspark folks are used to working with notebooks mostly jupyter and sometimes zeppelin. Notebooks provides a wonderful way to execute code line by line and get evaluated result at every paragraph. As developers we just have one problem with it, the notebook is far from a full blown IDE.
Why IDE
- IDE gives a wonderful way to get inside the code of libraries and understand them.
- It provides you with wonderful typeahead ( autocomplete)
- Lets you run things in debug mode
Lets get rolling
- Prerequisites:
- Install python 3.6
brew install https://raw.githubusercontent.com/Homebrew/homebrew-core/f2a764ef944b1080be64bd88dca9a1d80130c558/Formula/python.rb
python --version
- Download Intellij — https://www.jetbrains.com/idea/download/ ( community edition will also do fine)
2. Install python plugin for intellij and restart
- Go to Preferences -> Plugins -> Market Place -> Python -> Install
Lets get to actual code, we will setup spark deep learning project with Keras : https://towardsdatascience.com/deep-learning-with-apache-spark-part-2-2a2938a36d35
3. Create new project with Intellij, choose Python style, give it a name
4. Open Project Structure and click on New
Project SDK , choose Virtual Environment
, Select a location within your project as venv
and base interpreter from previously installed python
5. Lets add requirements.txt from https://github.com/FavioVazquez/deep-learning-pyspark/blob/master/requirements.txt so that we have all the python requirements in. We need to add pyspark
and ipython
6. Create a new file SparkDL.py
and copy over the contents from notebook, it would show lot of red lines indicating non-linked package. https://github.com/Gauravshah/pyspark-intellij-tutorial/blob/master/SparkDL.py
7. We need to add the deep-learning library, but then we need deep-learning jar as well as transitive dependencies, for this we create pom.xml with deep-learning library. Using Maven installer we copy all the dependencies to the jars
folder. https://github.com/Gauravshah/pyspark-intellij-tutorial/blob/master/pom.xml
8. Now we need to add these jars to the IDE environment. Goto Project Structure
, Libraries
, then add Java
lib and select jars folder. This will make your IDE understand the python code inside the jars
9. We still need to add jars to pyspark startup. Open Run
and then Edit Configuration
. Goto Templates, Python and add following Environment Variables :
- PYSPARK_SUBMIT_ARGS
—-driver-memory 2g --jars jars/*.jar pyspark-shell
- OBJC_DISABLE_INITIALIZE_FORK_SAFETY
YES
- no_proxy
*
10. Add the flower_photo
and sample_photos as per tutorial
11. Finally right-click and run your SparkDL
Success!!!
All of the source code published at : https://github.com/Gauravshah/pyspark-intellij-tutorial