Running Spark on Local Machine

Apache Spark is a fast and general-purpose cluster computing system. To get maximum potential out of it, Spark should be running on a distributed computing system. However, one might not have access to any distributed system all the time. Specially, for learning purpose one might want tor run spark on his/her own computer. This is actually a very easy task to do. There is a handful of way to do this. I would show, what I have done to run Spark on my laptop.

The first step is to download Spark from this link (in my case I put it in the home directory). Then unzip the folder using command line, or right clicking on the *.tar file. The following figure shows my unzipped folder, from where I would run Spark.

Downloaded spark (unzipped)

Running Spark from command line

Now, we can easily run Spark from command line. We need to grab the location of the folder. In my case, the location (can be copied from address bar using CTRL + L) is /home/user_name/spark-2.2.1-bin-hadoop2.7. Now I need to set some variables $SPARK_HOME and $PYSPARK_PYTHON. After doing this, I can easily run spark by writing ${SPARK_HOME}/bin/pyspark. The lines I used in my command line is given below:

Now, running a spark code is easy as well. I just need to write the following line in the command line (sparkcode.py is the file where I have written a few lines of Spark code).

Okay. So everything works fine till now. However, if we close the terminal and write ${SPARK_HOME}/bin/pyspark on a new terminal, it will not work, because, the environments variables I set is not anymore. Do I need to set the variables all the time I open a new terminal? To make it permanent we need to edit the .bashrc file in your home directory. Let’s check if the file is there.

The file is a hidden file in your home directory, as shown in the figure below.

Now, we need to open .bashrc in any editor (I am opening it with nano) and add the lines as shown in the following piece of code.

Add these lines at the end of the file, save the file (CTRL + X) and exit:

We are done! We can just write pyspark in the command line to start Spark as shown in the following figure.

While the job is running, we can access the web front-end at http://localhost:4040/.

Running Spark on Jupyter notebook

The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text [1]. This is also very popular among data scientists. I use Jupyter a lot, specially for small size project and where I need to add explanation and graphs for visualization. As I also use Spark for different projects, I need to run Spark from my Jupyter notebook. There are a few ways of doing that. The simplest way is to install the package findspark.

I already have Jupyter installed in my laptop. Now I just need to install findspark using the following command in command line. Without any arguments, the $SPARK_HOME environmental variable will be used by findspark, so the previous step where we set the value of $SPARK_HOME is a prerequisite.

Now we can start Jupyter notebook by writing Jupyter notebook on the command line.

It will initiate Jupyter notebook in a browser. We can then use the following code to import findspark and then we can run Spark and do anything we want.

This is what I am running as a test code, to check if Spark really works.

So, we are all set. Let’s have some fun with Spark.

Data Scientist, Machine Learning and AI Enthusiast