A woman sits in front of a computer to represent a data engineer learning IntelliJ, Git, Java and Apache Spark
Source:https://www.vecteezy.com/free-vector/data-engineer

Getting Set Up With IntelliJ, Git, Java, and Apache Spark

Nick Rafferty
Analytics Vidhya
Published in
8 min readSep 7, 2020

--

Installations. The dread is already sinking in at the word.

BUT WAIT… Before you throw your computer across the room, let’s see if we can do this together.

Source: https://www.reddit.com/r/reactiongifs/comments/2vr1uf/mrw_document1_word_not_responding/

Setting Up IntelliJ

Let’s warm up with an easy one. IntelliJ is an Integrated Development Environment (IDE) that will give you a great user interface to work with for data engineering.

I’d recommend starting with the community version. It’s free and has all the functionality you need to get started on some data engineering projects. Download IntelliJ here: https://www.jetbrains.com/idea/download

Click the download and you should see it start to install.

Installation Bar for IntelliJ

Click next through a couple of default options, at one point you’ll be asked to pick your theme. You can choose between either light or dark, and don’t worry you can change at any time!

Customize Dark or Light User Interface in IntelliJ

Important! Make sure to add the Scala Plugin

Scala will be the language that we use to write our Spark Applications. This plugin adds some built-in functionality to make learning Apache Spark and Scala more beginner-friendly.

Scala plugin under featured plugins section

And that’s it! You’re good to go. You should see a welcome message like this:

IntelliJ welcome screen

Setting Up Git

Before you install Git on your computer, let’s first run a quick test to see if you already have it installed. Open up the terminal (if you’re using a Mac) or the Command Prompt (if using Windows). You can search for them on your computer if you have never used them before.

Once you’ve opened the terminal, type: git --version. If you already have git installed, it will print out something similar to git version 2.21.1. If not, no worries, that’s what this tutorial is for!

Windows and Mac users will have different steps, so follow along to whichever is applicable to you.

Windows Users:

  1. Download Git: https://git-scm.com/downloads/. Make sure to click Windows under the downloads section.
  2. Click on the download once it completes, you should see an installer pop up. You’ll install Git under C:\Program Files\Git. This is the default so you should not have to change anything.
Windows Git installer to install in folder C:\Program Files\Git

3. There will be multiple options that you are given for each step along the way. Click ‘Next’ and accept the default options that you are presented with. Here are a couple of examples below:

Adjusting your Path Environment: Git from the command line and also 3rd party software is checked
Choose the default behavior of git pull: default is checked.

4. Edit your system environment variables. This will be a crucial step in order to configure git on Windows. To find these, type edit the system environment variables in your windows search bar. You should see this popup.

Edit the system environment variables on Windows.

5. Click on Path to highlight it blue, then click edit.

Path is clicked under the windows environment variables.

6. Add two new paths here. Click ‘New’ and type in these two paths:

C:\Program Files\Git\bin\

C:\Program Files\Git\cmd\

Add two new paths.

7. Click ‘Ok’ and you’re done! Now close the Command Prompt and re-open. Type git --version and you should see something like git version 2.28.0.windows.1

Mac Users

  1. In IntelliJ’s welcome screen, click on ‘Get from Version Control.’ You’ll be prompted with a notification that says the “git” command requires the command line developer tools. Click Install.
Git command requires the command line developer tools. Click install.

2. After waiting for the installation (can take a few minutes), you should be good to go! Open the terminal by clicking Command + Space to open Spotlight and search for Terminal.

3. In terminal, type git --version. If successful, you should see something like: git version 2.21.1 (Apple Git-122.3)

Setting Up Java

Similarily to Git, you can check if you already have Java installed by typing in java --version. For Apache Spark, we will use Java 11 and Scala 2.12. If you do not have Java 11 installed, follow these steps:

Windows Users

  1. Navigate to Oracle’s Java 11 download: https://www.oracle.com/java/technologies/javase-jdk11-downloads.html
  2. Click to download the Windows x64 Installer. It will be a .exe file.
Windows x64 Installer

3. After accepting the Oracle agreement, you’ll be prompted to create an Oracle account.

Create an Oracle Account

4. Create your oracle account in order to download Java 11. Oracle will not let you download unless you create an account.

Oracle webpage: Create your Oracle Account

5. Once you complete your account, you may need to click on the https://www.oracle.com/java/technologies/javase-jdk11-downloads.html Windows x64 Installer again. Now it should allow you to download the .exe file to your computer.

6. Click ‘Next’ through the default steps until you see that Java 11 was successfully installed.

Java SE development Kit Sucessfully Installed

7. Search for ‘Edit System Environment Variables’ as shown in the Git setup section for Windows.

8. Under ‘User variables’ click ‘New.’ Add JAVA_HOME as the variable with your path: C:\Program Files\Java\jdk-11.0.8

Adding JAVA_HOME variable to the user variables

9. Click ‘Path’ to highlight it blue, then click ‘Edit. ‘

10. Add one new path here. Click ‘New’ and type:

%JAVA_HOME%\bin

Pay attention to which version you downloaded. The 11.0.8 in my example may change based on your version.

10. Open a new Command Prompt, and type in java --version. You should now see java 11.0.8 pop up. Congrats on setting up Java!

Command prompt command: git — version

Mac Users

  1. Navigate to Oracle’s Java 11 download: https://www.oracle.com/java/technologies/javase-jdk11-downloads.html
  2. Click the download for the macOS Installer, it will be a .dmg file.
Mac OS Installer

3. After accepting the Oracle agreement, you’ll be prompted to create an Oracle account.

Create and Oracle Account

4. Create your oracle account in order to download Java 11. Oracle will not let you download unless you create an account.

5. Once you complete your account, you may need to click on the https://www.oracle.com/java/technologies/javase-jdk11-downloads.html macOS Installer again. Now it should allow you to download the .dmg file to your computer.

6. Click to open the download and continue through the JDK 11.0.8 installer.

JDK 11.0.8 Installer: Click Continue

7. Open up a terminal and type: java --version. You should now see Java SE 11.0.8. You’re all set up!

Getting Setup with Apache Spark

We are in the home stretch! Let’s finish strong with your first Apache Spark Program.

  1. In a new tab, pull up your GitHub account. If you don’t have one, now is the perfect time to start tracking your Data Engineering progress. If you do not have a GitHub account, create one here: https://github.com/
  2. Navigate to https://github.com/nickrafferty78/Apache-Spark-Is-Fun and Fork the repository.
  3. On your repository, click Code and copy the URL that pops up below.
Clone GitHub repository for this tutorial with: “Clone with Https”

4. Navigate to a new folder that you want to store your project in. For this example, under Documents, I have created a folder called temp. Type this command: git clone https://github.com/nickrafferty78/Apache-Spark-Is-Fun.git

Clone repository in your terminal with the command ‘git clone’

5. Open IntelliJ and click ‘Open.’ Navigate to the folder that you just cloned the repository into and click ‘Ok.’

6. In the top left corner of IntelliJ, click the dropdown on Project and click Project Files.

Project Files

7. Navigate to src/main/scala/Tutorial.

Navigate to src/main/scala/Tutorial.scala

8. You’ll see a popup that says ‘Project JDK is Not Defined.’ Click ‘Setup JDK.’ Make sure you don’t click Download JDK or Add JDK. Instead, you’ll click the link to the detected SDKs to set up Java 11 as your project version.

Detected SDK: JDK 11

9. After you click that, be patient because it will take a few minutes to Index. You can check the status from a progress bar on the bottom of IntelliJ.

10. You should now be able to run your first Spark Program! Click the green play button that shows up next to object Tutorial extends App { in the src/main/scala/Tutorial directory.

One final note for Windows Users

You have one more step! If you run this program as is, you’ll likely see and Error that says: Filed to locate the winutils binary in the hadoop binary path.

To fix this you’ll need to follow these steps:

  1. Download the winutils.exe from http://public-repo-1.hortonworks.com/hdp-win-alpha/winutils.exe
  2. Create a new folder to store this. In my example I use C:\winutils\bin (The file winutils.exe is placed inside of the bin folder nested in the winutils folder)
  3. Move winutils.exe inside the folder you just created: C:\winutils\bin
  4. Back in IntelliJ, let’s set up your Hadoop home directory. Add this line of code right under object Tutorial extends App {

System.setProperty("hadoop.home.dir", "C:\winutils")

And you’re done!

Source:
nbctv.tumblr.com/post/60082868678/working-hard-or-hardly-working-we-hope-its-the

You just made it through everything you need to be set up for Data Engineering! Feel free to have a little Office themed dance party 🎉

Continue Learning

Continue learning Apache Spark through my series: Apache Spark is Fun!

--

--

Nick Rafferty
Analytics Vidhya

Tech is amazing if everyone is involved. I’m always striving to make that happen.