100 Scripts in 30 Days challenge: Script 26,27 Learning PySpark
In these series of scripts I will explore basics of PySpark which provides integrated API bindings around Apache Spark and enables full usage of the Python ecosystem within all the nodes of the cluster with the pickle Python serialization and, more importantly, supplies access to the rich ecosystem of Python’s machine learning libraries such as Scikit-Learn or data processing such as Pandas.
How PySpark Works
When we initialize a Spark program, the first thing a Spark program must do is to create a SparkContext object. It tells Spark how to access the cluster. The Python program creates a PySparkContext. Py4J is the gateway that binds the Python program to the Spark JVM SparkContext.
The JVM SparkContextserializes the application codes and the closures and sends them to the Spark cluster for execution. The cluster manager allocates resources and schedules, and ships the closures to the Spark workers in the cluster who activate Python virtual machines as required. In each machine, the Spark Worker is managed by an executor that controls computation, storage, and cache.
Spark applications are run as an independent set of processes, coordinated by a SparkContext in a driver program. The SparkContext will be allocated system resources (machines, memory, CPU) from the Cluster manager and manages executors who manage workers in the cluster. The driver program has Spark jobs that need to run. The jobs are split into tasks submitted to the executor for completion. The executor takes care of computation, storage, and caching in each machine.
The key building block in Spark is the RDD (Resilient Distributed Dataset). A dataset is a collection of elements. Distributed means the dataset can be on any node in the cluster. Resilient means that the dataset could get lost or partially lost without major harm to the computation in progress as Spark will re-compute from the data lineage in memory, also known as the DAG (short for Directed Acyclic Graph) of operations. Basically, Spark will snapshot in memory a state of the RDD in the cache. If one of the computing machines crashes during operation, Spark rebuilds the RDDs from the cached RDD and the DAG of operations. RDDs recover from node failure.
An RDD is immutable. Once created, it cannot be changed. Each transformation creates a new RDD. Transformations are lazily evaluated. Transformations are executed only when an action occurs. In the case of failure, the data lineage of transformations rebuilds the RDD.
An action on an RDD triggers a Spark job and yields a value. An action operation causes Spark to execute the (lazy) transformation operations that are required to compute the RDD returned by the action.
The below will be set of code that I created in jupyter notebook and would reflect on the basics of pySpark and RDD.
To install jupyter please check the below link:
Jupyter installation requires Python 3.3 or greater, or Python 2.7. IPython 1.x, which included the parts that later…jupyter.readthedocs.io
To install pyspark please check below documentation
The Spark Python API (PySpark) exposes the Spark programming model to Python. To learn the basics of Spark, we…spark.apache.org
Given that jupyter notebooks and pyspark is installed
To start pyspark with jupyter notebook on Mac or Linux you can run the below command:
If all goes well you should see the jupyter notebooks showing in the browser.
Note: Right now pyspark works with Python 2.6 and above and it has not been tested for Python 3 versions.
The below script gives a basic example of a pyspark program where where we Read a text file and count non empty line
The below script gives a basic example of a pyspark program where where we Read a text file and count number of words.