Getting Set Up With IntelliJ, Git, Java, and Apache Spark
Installations. The dread is already sinking in at the word.
BUT WAIT… Before you throw your computer across the room, let’s see if we can do this together.
Setting Up IntelliJ
Let’s warm up with an easy one. IntelliJ is an Integrated Development Environment (IDE) that will give you a great user interface to work with for data engineering.
I’d recommend starting with the community version. It’s free and has all the functionality you need to get started on some data engineering projects. Download IntelliJ here: https://www.jetbrains.com/idea/download
Click the download and you should see it start to install.
Click next through a couple of default options, at one point you’ll be asked to pick your theme. You can choose between either light or dark, and don’t worry you can change at any time!
Important! Make sure to add the Scala Plugin
Scala will be the language that we use to write our Spark Applications. This plugin adds some built-in functionality to make learning Apache Spark and Scala more beginner-friendly.
And that’s it! You’re good to go. You should see a welcome message like this:
Setting Up Git
Before you install Git on your computer, let’s first run a quick test to see if you already have it installed. Open up the terminal (if you’re using a Mac) or the Command Prompt (if using Windows). You can search for them on your computer if you have never used them before.
Once you’ve opened the terminal, type: git --version
. If you already have git installed, it will print out something similar to git version 2.21.1
. If not, no worries, that’s what this tutorial is for!
Windows and Mac users will have different steps, so follow along to whichever is applicable to you.
Windows Users:
- Download Git: https://git-scm.com/downloads/. Make sure to click Windows under the downloads section.
- Click on the download once it completes, you should see an installer pop up. You’ll install Git under
C:\Program Files\Git
. This is the default so you should not have to change anything.
3. There will be multiple options that you are given for each step along the way. Click ‘Next’ and accept the default options that you are presented with. Here are a couple of examples below:
4. Edit your system environment variables. This will be a crucial step in order to configure git on Windows. To find these, type edit the system environment variables in your windows search bar. You should see this popup.
5. Click on Path to highlight it blue, then click edit.
6. Add two new paths here. Click ‘New’ and type in these two paths:
C:\Program Files\Git\bin\
C:\Program Files\Git\cmd\
7. Click ‘Ok’ and you’re done! Now close the Command Prompt and re-open. Type git --version
and you should see something like git version 2.28.0.windows.1
Mac Users
- In IntelliJ’s welcome screen, click on ‘Get from Version Control.’ You’ll be prompted with a notification that says the “git” command requires the command line developer tools. Click Install.
2. After waiting for the installation (can take a few minutes), you should be good to go! Open the terminal by clicking Command + Space to open Spotlight and search for Terminal.
3. In terminal, type git --version
. If successful, you should see something like: git version 2.21.1 (Apple Git-122.3)
Setting Up Java
Similarily to Git, you can check if you already have Java installed by typing in java --version
. For Apache Spark, we will use Java 11 and Scala 2.12. If you do not have Java 11 installed, follow these steps:
Windows Users
- Navigate to Oracle’s Java 11 download: https://www.oracle.com/java/technologies/javase-jdk11-downloads.html
- Click to download the Windows x64 Installer. It will be a .exe file.
3. After accepting the Oracle agreement, you’ll be prompted to create an Oracle account.
4. Create your oracle account in order to download Java 11. Oracle will not let you download unless you create an account.
5. Once you complete your account, you may need to click on the https://www.oracle.com/java/technologies/javase-jdk11-downloads.html Windows x64 Installer again. Now it should allow you to download the .exe file to your computer.
6. Click ‘Next’ through the default steps until you see that Java 11 was successfully installed.
7. Search for ‘Edit System Environment Variables’ as shown in the Git setup section for Windows.
8. Under ‘User variables’ click ‘New.’ Add JAVA_HOME as the variable with your path: C:\Program Files\Java\jdk-11.0.8
9. Click ‘Path’ to highlight it blue, then click ‘Edit. ‘
10. Add one new path here. Click ‘New’ and type:
%JAVA_HOME%\bin
Pay attention to which version you downloaded. The 11.0.8 in my example may change based on your version.
10. Open a new Command Prompt, and type in java --version
. You should now see java 11.0.8 pop up. Congrats on setting up Java!
Mac Users
- Navigate to Oracle’s Java 11 download: https://www.oracle.com/java/technologies/javase-jdk11-downloads.html
- Click the download for the macOS Installer, it will be a .dmg file.
3. After accepting the Oracle agreement, you’ll be prompted to create an Oracle account.
4. Create your oracle account in order to download Java 11. Oracle will not let you download unless you create an account.
5. Once you complete your account, you may need to click on the https://www.oracle.com/java/technologies/javase-jdk11-downloads.html macOS Installer again. Now it should allow you to download the .dmg file to your computer.
6. Click to open the download and continue through the JDK 11.0.8 installer.
7. Open up a terminal and type: java --version
. You should now see Java SE 11.0.8
. You’re all set up!
Getting Setup with Apache Spark
We are in the home stretch! Let’s finish strong with your first Apache Spark Program.
- In a new tab, pull up your GitHub account. If you don’t have one, now is the perfect time to start tracking your Data Engineering progress. If you do not have a GitHub account, create one here: https://github.com/
- Navigate to https://github.com/nickrafferty78/Apache-Spark-Is-Fun and Fork the repository.
- On your repository, click Code and copy the URL that pops up below.
4. Navigate to a new folder that you want to store your project in. For this example, under Documents, I have created a folder called temp. Type this command: git clone https://github.com/nickrafferty78/Apache-Spark-Is-Fun.git
5. Open IntelliJ and click ‘Open.’ Navigate to the folder that you just cloned the repository into and click ‘Ok.’
6. In the top left corner of IntelliJ, click the dropdown on Project and click Project Files.
7. Navigate to src/main/scala/Tutorial.
8. You’ll see a popup that says ‘Project JDK is Not Defined.’ Click ‘Setup JDK.’ Make sure you don’t click Download JDK or Add JDK. Instead, you’ll click the link to the detected SDKs to set up Java 11 as your project version.
9. After you click that, be patient because it will take a few minutes to Index. You can check the status from a progress bar on the bottom of IntelliJ.
10. You should now be able to run your first Spark Program! Click the green play button that shows up next to object Tutorial extends App {
in the src/main/scala/Tutorial directory.
One final note for Windows Users
You have one more step! If you run this program as is, you’ll likely see and Error that says: Filed to locate the winutils binary in the hadoop binary path
.
To fix this you’ll need to follow these steps:
- Download the winutils.exe from http://public-repo-1.hortonworks.com/hdp-win-alpha/winutils.exe
- Create a new folder to store this. In my example I use
C:\winutils\bin
(The file winutils.exe is placed inside of the bin folder nested in the winutils folder) - Move
winutils.exe
inside the folder you just created:C:\winutils\bin
- Back in IntelliJ, let’s set up your Hadoop home directory. Add this line of code right under
object Tutorial extends App {
System.setProperty("hadoop.home.dir", "C:\winutils")
And you’re done!
nbctv.tumblr.com/post/60082868678/working-hard-or-hardly-working-we-hope-its-the
You just made it through everything you need to be set up for Data Engineering! Feel free to have a little Office themed dance party 🎉
Continue Learning
Continue learning Apache Spark through my series: Apache Spark is Fun!