Install Spark on Ubuntu (PySpark)
The video above demonstrates one way to install Spark (PySpark) on Ubuntu. The following instructions guide you through the installation process. Please subscribe on youtube if you can.
Prerequisites: Anaconda. If you already have anaconda installed, skip to step 2.
- Download and install Anaconda. If you need help, please see this tutorial.
- Go to the Apache Spark website (link)
2. Make sure you have java installed on your machine. If you don’t, I found the link below useful.
3. Go to your home directory (command is in bold)
4. Unzip the folder in your home directory using the following command.
tar -zxvf spark-2.0.0-bin-hadoop2.7.tgz
5. Use the following command to see that you have a .bashrc file
6. Next, we will edit our .bashrc so we can open a spark notebook in any directory
7. Don’t remove anything in your .bashrc file. Add the following to the bottom of your .bashrc file
8. Save and exit out of your .bashrc file. Either close the terminal and open a new one or in your terminal type:
Notes: The PYSPARK_DRIVER_PYTHON parameter and the PYSPARK_DRIVER_PYTHON_OPTS parameter are used to launch the PySpark shell in Jupyter Notebook. The master parameter is used for setting the master node address. Here we launch Spark locally on 2 cores for local testing.
Please let me know if you have any questions. You can also test your PySpark installation here!
Common issues: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
If you get this type of error message, the next couple of steps can help.
- Download hadoop binary (link, basically another file) and put it in your home directory
(you can choose a different hadoop version if you like and change the next steps accordingly)
2. Unzip the folder in your home directory using the following command.
tar -zxvf hadoop-2.8.0.tar.gz
3. Now add export HADOOP_HOME=~/hadoop-2.8.0 to your .bashrc file. Open a new terminal and try again.