Integrating Apache Spark 2.0 with PyCharm CE

Published in

Parrot Prediction

2 min readNov 2, 2017

The following post presents how to configure JetBrains PyCharm CE IDE to develop applications with Apache Spark 2.0+ framework.

Download Apache Spark distribution pre-built for Hadoop (link).

2. Unpack the archive. This directory will be later referred as $SPARK_HOME.

3. Start PyCharm and create a new project File → New Project. Call it "spark-demo".

4. Inside project create a new Python file — New → Python File. Call it run.py.

5. Write a simple script counting the occurrences of A’s and B’s inside Spark’s README.md file. Don’t worry about the errors, we will fix them in next steps.

6. Add required libraries. PyCharm → Preferences ... → Project spark-demo → Project Structure → Add Content Root. Select all ZIP files from $SPARK_HOME/python/lib. Apply changes.

7. Create a new run configuration. Go into Run → Edit Configurations → + → Python. Name it "Run with Spark" and select the previously created file as the script to be executed.