Working With Apache Spark, Python and PySpark

This article is a quick guide to Apache Spark single node installation, and how to use Spark python library PySpark.

1. Environment

  • Hadoop Version: 3.1.0
  • Apache Kafka Version: 1.1.1
  • Operating System: Ubuntu 16.04
  • Java Version: Java 8

2. Prerequisites

Apache Spark requires Java. To ensure that Java is installed, first update the Operating System then try to install it:

sudo apt-get update
sudo apt-get –y upgrade
sudo add-apt-repository -y ppa:webupd8team/java
sudo apt-get install oracle-java8-installer

3. Installing Apache Spark

3.1. Download and install Spark

First, we need to create a directory for apache Spark.

sudo mkdir /opt/spark

Then, we need to download apache spark binaries package.

wget “

Next, we need to extract apache spark files into /opt/spark directory

sudo tar -xzvf spark-2.3.1-bin-hadoop2.7.tgz --directory=/opt/spark -- strip 1

3.2. Configure Apache Spark

When Spark launches jobs it transfers its jar files to HDFS so they’re available to any machines working. These files are a large overhead on smaller jobs so I’ve packaged them up, copied them to HDFS and told Spark it doesn’t need to copy them over any more.

jar cv0f ~/spark-libs.jar -C /opt/spark/jars/ .
hdfs dfs -mkdir /spark-libs
hdfs dfs -put ~/spark-libs.jar /spark-libs/

After copying the files we must tell Spark to ignore copying jar files from the spark defaults configuration file:

sudo gedit /opt/spark/conf/spark-defaults.conf

Add the following lines:

spark.master spark://localhost:7077
spark.yarn.preserve.staging.files true
spark.yarn.archive hdfs:///spark-libs/spark-libs.jar

In this article we will configure Apache Spark to run on a single node, so it will be only localhost:

sudo gedit /opt/spark/conf/slaves

Make sure that it contains only the value localhost

Before running the service we must open .bashrc file using gedit

sudo gedit ~/.bashrc

And add the following lines

export SPARK_HOME=/opt/spark
export SPARK_CONF_DIR=/opt/spark/conf
export SPARK_MASTER_HOST=localhost

Now, we have to run Apache Spark services:

sudo /opt/spark/sbin/
sudo /opt/spark/sbin/

4. Installing Python

4.1. Getting latest Python release

Ubuntu 16.04 ships with both Python 3 and Python 2 pre-installed. To make sure that our versions are up-to-date, we must update and upgrade the system with apt-get (mentioned in the prerequisites section):

sudo apt-get update
sudo apt-get -y upgrade

We can check the version of Python 3 that is installed in the system by typing:

python3 –V

It must return the python release (example: Python 3.5.2)

4.2. Install Python utilities

To manage software packages for Python, we must install pip utility:

sudo apt-get install -y python3-pip

There are a few more packages and development tools to install to ensure that we have a robust set-up for our programming environment.

sudo apt-get install build-essential libssl-dev libffi-dev python-dev

4.3. Building the environment

We need to first install the venv module, which allow us to create virtual environments:

sudo apt-get install -y python3-venv

Next, we have to create a directory for our environment

mkdir testenv

Now we have to go to this directory and create the environment (all environment file will be created inside a directory that we called my_env):

cd testenv
python3 -m venv my_env

We finished we can check the environment files created using the ls my_env

To use this environment, you need to activate it:

source my_env/bin/activate

5. Working with PySpark

5.1. Configuration

First we need to open the .bashrc file

sudo gedit ~/.bashrc

And add the following lines:

export PYTHONPATH=/usr/lib/python3.5
export PYSPARK_SUBMIT_ARGS=” -- master local[*] pyspark-shell”
export PYSPARK_PYTHON=/usr/bin/python3.5

5.2. FindSpark library

If we have Apache Spark installed on the machine we don’t need to install the pyspark library into our development environment. We need to install the findspark library which is responsible of locating the pyspark library installed with apache Spark.

pip3 install findspark

In each python script file we must add the following lines:

import findspark

5.3. PySpark example

5.3.1. Reading from HDFS

The following script is to read from a file stored in hdfs

import findspark
from pyspark.sql import SparkSession
sparkSession = SparkSession.builder.appName(“example-pyspark-hdfs”).getOrCreate()
df_load =‘hdfs://localhost:9000/myfiles/myfilename’)

5.3.2. Reading from Apache Kafka consumer

We first must add the spark-streaming-kafka-0–8-assembly_2.11–2.3.1.jar library to our Apache spark jars directory /opt/spark/jars. We can download it from mvn repository:


The following codes read messages from a Kafka topic consumer and print them line by line:

import findspark
from kafka import KafkaConsumer
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
KAFKA_TOPIC = ‘KafkaTopicName’
KAFKA_BROKERS = ‘localhost:9092’
ZOOKEEPER = ‘localhost:2181’
sc = SparkContext(‘local[*]’,’test’)
ssc = StreamingContext(sc, 60)
kafkaStream = KafkaUtils.createStream(ssc, ZOOKEEPER, ‘spark-streaming’, {KAFKA_TOPIC:1})
lines = x: x[1])

6. Bibliography

[1] M. Litwintschik, “Hadoop 3 Single-Node Install Guide,” 19 March 2018. [Online]. Available: [Accessed 01 June 2018].

[2] L. Tagiaferri, “How To Install Python 3 and Set Up a Local Programming Environment on Ubuntu 16.04,” 20 December 2017. [Online]. Available: [Accessed 01 August 2018].

[3] “Apache Spark Official Documentation,” [Online]. Available: [Accessed 05 August 2018].

[4] “Stack Overflow Q&A” [Online]. Available: [Accessed 01 June 2018].

[5] A. GUPTA, “Complete Guide on DataFrame Operations in PySpark,” 23 October 2016. [Online]. Available: [Accessed 14 August 2018].