Apache Spark on my Mac

Pyspark + mac

Apache spark
Apache Spark™ is a fast and general engine for large-scale data processing.

Installation and setup

Hit the Spark download link and download the spark-2.1.1-bin-hadoop2.7.tgz package.

naviagate to the downloaded directory and execute the following commands in terminal

tar xvf spark-2.1.1-bin-hadoop2.7.tgz

The above command extracts the tarGunZip and creates a folder spark-2.1.1-bin-hadoop2.7. On completion of extraction execute the following commands in terminal to move spark files to /usr/local/spark

cd <path where spark is extracted>
mv spark-2.1.1-bin-hadoop2.7 /usr/local/spark

Let’s go ahead and add path to our bash_profile(mac), bashrc(linux)

export PATH = $PATH:/usr/local/spark/bin

Now source the profile to use the updated “PATH” variable.

source ~/.bash_profile
source ~/.bashrc

In order to run spark , we need to install certain prerequisites.


Once we are done with all the above setup we can confirm the installation using following commands

Verify Java

java -version
#java version "1.8.0_121"
#Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
#Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)

Verify python

python -V
#Python 3.6.1 :: Anaconda custom (x86_64)

Verify Anaconda

conda -V
conda 4.3.22

Verify spark installation

Execute the following command in terminal

Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.1.1
Using Python version 3.6.1 (default, May 11 2017 13:04:09)
SparkSession available as 'spark'.

Now that we are done setting up spark let us run an example and test our spark

Let’s use the README.md which is located inside the extracted folder.

>>> data = sc.textFile('/usr/local/spark/README.md')
>>> data.count()
>>> data.first()
'# Apache Spark'

You have successfully setup apache Spark on our mac.

Happy coding!