How to kick-start Spark development on IntelliJ IDEA in 4 steps

Michael Hausenblas
Large-scale Data Processing
2 min readFeb 8, 2015

--

I’m gonna walk you through the process of how to set up your environment in order to develop an Apache Spark application, using Scala, in IntelliJ IDEA 14.

  1. Download and install IntelliJ 14 and make sure the Scala plugin is enabled.
  2. Get the Scala Spark skeleton and move it in a directory of your choice; I’m assuming ~/sandbox/sparkgrep/ as the base now, in the following. You cd into the base, create a directory src/main/scala/spark/example/ — under Linux/MacOS with mkdir -p to create intermediate directories — and move SparkGrep.scala into the just created directory.
  3. Now head back to IntelliJ where you import the base directory (File → Import Project), wait a bit until everything is imported and then select the file SparkGrep.scala, right click it and choose ‘Run SparkGrep.scala’. That will fail with a message Usage: SparkGrep <host> <input_file> <match_term> because we haven’t supplied program arguments for execution, yet.
  4. So, to fix this open menu item Run → Edit Configurations, making sure the SparkGrep application is selected, and then insert the following into the Program arguments input field (found in the Configurations tab):
local[*] src/main/scala/spark/uha/SparkGrep.scala val

The above arguments mean to run the app locally, using src/main/scala/spark/uha/SparkGrep.scala as the input file, and to count the occurrences of val … this should yield something like the following as the final output line:

5 lines in src/main/scala/spark/uha/SparkGrep.scala contain val

Well done! You deserve your first spark. Please peel it off from the left and stick it on your laptop ;)

Where to go from here? Well, first I’d say you check out:

And once you’ve toyed around with Spark a bit and gathered enough experience deploying Spark in a cluster, mixing SQL with machine learning code, and tuning stream code I strongly recommend:

Have fun and looking forward to your comments!

--

--

Michael Hausenblas
Large-scale Data Processing

open-source observability @ AWS | opinions -: own | 塞翁失马