The following post presents how to configure JetBrains PyCharm CE IDE to develop applications with Apache Spark 2.0+ framework.

  1. Download Apache Spark distribution pre-built for Hadoop (link).

2. Unpack the archive. This directory will be later referred as $SPARK_HOME.

3. Start PyCharm and create a new project File → New Project. Call it "spark-demo".

4. Inside project create a new Python file — New → Python File. Call it run.py.

5. Write a simple script counting the occurrences of A’s and B’s inside Spark’s README.md file. Don’t worry about the errors, we will fix them in next steps.

6. Add required libraries. PyCharm → Preferences ... → Project spark-demo → Project Structure → Add Content Root. Select all ZIP files from $SPARK_HOME/python/lib. Apply changes.

7. Create a new run configuration. Go into Run → Edit Configurations → + → Python. Name it "Run with Spark" and select the previously created file as the script to be executed.

8. Add environment variables. Inside created configuration add the corresponding environment variables. Save all changes.

9. Run the scriptRun → Run 'Run with Spark'. You should see that the script is executed properly within Spark context.

Now you can improve your working experience with IDE advanced features like debugging or code completion.

Happy coding.

--

--

Norbert Kozlowski
Parrot Prediction

I’m hopeful that one day, when machines rule the world, I will be their best friend.