Apache Spark on my Mac

Pyspark + mac

Apache spark
Apache Spark™ is a fast and general engine for large-scale data processing.

Installation and setup

Hit the Spark download link and download the spark-2.1.1-bin-hadoop2.7.tgz package.

naviagate to the downloaded directory and execute the following commands in terminal

tar xvf spark-2.1.1-bin-hadoop2.7.tgz

The above command extracts the tarGunZip and creates a folder spark-2.1.1-bin-hadoop2.7. On completion of extraction execute the following commands in terminal to move spark files to /usr/local/spark

cd <path where spark is extracted>
mv spark-2.1.1-bin-hadoop2.7 /usr/local/spark

Let’s go ahead and add path to our bash_profile(mac), bashrc(linux)

export PATH = $PATH:/usr/local/spark/bin

Now source the profile to use the updated “PATH” variable.

#mac
source ~/.bash_profile
#linux,unix
source ~/.bashrc

In order to run spark , we need to install certain prerequisites.

Prerequisites:

Once we are done with all the above setup we can confirm the installation using following commands

Verify Java

java -version
-------------------
output
-------------------
#java version "1.8.0_121"
#Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
#Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)

Verify python

python -V
-------------------
output
-------------------
#Python 3.6.1 :: Anaconda custom (x86_64)

Verify Anaconda

conda -V
-------------------
output
-------------------
conda 4.3.22

Verify spark installation

Execute the following command in terminal

pyspark
-------------------
output
-------------------
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.1.1
/_/
Using Python version 3.6.1 (default, May 11 2017 13:04:09)
SparkSession available as 'spark'.
>>>sc.version
'2.1.1'

Now that we are done setting up spark let us run an example and test our spark

Let’s use the README.md which is located inside the extracted folder.

>>> data = sc.textFile('/usr/local/spark/README.md')
>>> data.count()
104
>>> data.first()
'# Apache Spark'
>>>

You have successfully setup apache Spark on our mac.

Happy coding!