Introduction and Installation of PySpark for Mac Users

Mrinalini M
Analytics Vidhya
Published in
5 min readApr 15, 2020

This blog briefly introduces you to the not-so-new but one of the most widely used technologies in Big Data and AI. Apache Spark!

Yes, Spark is almost 10 years old by now. This blog also throws some light on PySpark and the steps required to install it in your machine.Let’s get started.

What is Spark? Why Spark at all?

Spark is an open source, big data platform which helps deal with larger datasets in a very fast and efficient way. It is a large scale data processing framework that performs cluster computing and processes tasks quickly, by splitting up the data among several nodes (think of it as computers) and performs computational task in each node of a cluster, in parallel. We usually use big data technologies when we have to deal with data which is so large that it doesn’t fit into your machine’s RAM. Many of you might be wondering why don’t we just go with Hadoop’s MapReduce to carry out such work. Well, that’s because Spark is 100x faster than Hadoop’s MapReduce. Here is how….

MapReduce writes the data to disk after each map and reduce operation while Spark keeps most of the data in memory after each transformation.

Be it a Data analyst or a Data Scientist who wants to rapidly query, analyse, transform or draw insights from the “Big” data, Spark is the go-to solution. Because spark has the libraries for SQL, Machine Learning, stream processing and graph computation. It also supports various programming languages like Java, Python, R.

PySpark

PySpark is an API for Spark framework released by Apache Spark to support Python with Spark. One can easily leverage all the capabilities of Spark using this API. Let’s look how we can setup PySpark in our machine.

Pre-requisites:

*Jump to Installation if you have all these installed*

1. Homebrew

If not installed, follow the following steps:

  • Go to terminal and run:
$ /usr/bin/ruby -e “$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
  • Once the installation is successful, run the following command to confirm the installation:
brew doctor

If you get a message saying “ Your system is ready to brew”, you can move on to the next step.

2. Python

I assume that you have python already installed, else you may install it using Homebrew again :

brew install python3

3. jupyter-notebook(optional)

You can either install it via the Anaconda navigator or follow this link or via Homebrew.

brew install jupyter

If you have all the prerequisites intact, then you are “Good to Go!” .

Installation

1. Install Java :

We need to install Java first because spark is written in Scala, which is a Java Virtual Machine language.

brew cask install java

This will install the latest version of java (Takes quite some time to get installed).

2. Install Scala :

For installing Apache-spark, Scala is a dependency . Hence install Scala using brew:

brew install scala

You can confirm the installation by :

brew info scala

and you see something looking like:

image 1

3. Install Spark

Apache Spark is a distributed Framework to handle Big Data. We can now install spark right away.

brew install apache-spark

and confirm (like we always do🤷🏻‍♀️) by :

brew info apache-spark
image 2

4. Install PySpark

Install pyspark using pip3 :

pip3 install pyspark

5. Setting up the environment variables.

We need to define certain environment variables and paths so that Spark is accessible via pyspark

  • Open your terminal and :
cd ~vim .bashrc

And then define the following variables :

export JAVA_HOME=/Library/java/JavaVirtualMachines/adoptopenjdk-8.jdk/contents/Home/export JRE_HOME=/Library/java/JavaVirtualMachines/openjdk-13.jdk/contents/Home/jre/export SPARK_HOME=/usr/local/Cellar/apache-spark/2.4.5/libexecexport PATH=/usr/local/Cellar/apache-spark/2.4.5/bin:$PATHexport PYSPARK_PYTHON=/usr/local/bin/python3export PYSPARK_DRIVER_PYTHON=jupyterexport PYSPARK_DRIVER_PYTHON_OPTS=’notebook’

Make sure you provide the proper spark version in 3rd and 4th variables.

Almost done! Now all we are left with, is to check whether pyspark works.

Now type:

pyspark

in your terminal, if spark is successfully installed with all dependencies and the environment properly set up, you should be able to get :

image 3

Congratulations!!! 👏🏻 You have setup PySpark successfully in your machine. Now you can either launch the jupyter notebook to further play around with PySpark functionalities or can start coding in your .py scripts. Well, we are already in the first step of learning PySpark!

Now launch your jupyter notebook and execute the following commands.

import os
import pyspark

PySpark isn’t on sys.path(the path where you have setup this) by default, to be used as a regular library. Hence there’s a need to add PySpark to sys.path at runtime. How do we do this? No worries.. The “findspark” package does that for you. Install findspark using the following command:

pip3 install findspark

findspark basically searches the path of the environment where your spark is setup and initialises it in your jupyter notebook.

The following code does the same:

import findspark
findspark.init()

Now to ensure all is well in jupyter, try the command:

#executing the shell.py inside the 'SPARK_HOME' environment
exec(open(os.path.join(os.environ["SPARK_HOME"],'python/pyspark/shell.py')).read())

Upon compiling the exec command, you’ll again see:

image 4

As you can see in the image 3 and image 4, we already have SparkSession created for us. What you can see in ‘spark’ is:

image5

Voilà!! All set!!👍🏼 You can now make use of the spark capabilities along with python and deal with larger datasets very efficiently and swiftly.

In the coming stories, I’ll be covering the concepts related to the RDDs, Dataframes and other relevant topics. Till then, Good Bye.

--

--