Tutorial on PySpark Transformations and Spark MLIB

Learn how to transform Data and apply Regression techniques using PySpark

5 min readOct 24, 2018

Rossmann Sales Dataset is used in this tutorial and it can be found at https://www.kaggle.com/c/rossmann-store-sales/data .
For installing spark 2.2.1 and to enable it on your Jupyter Notebook on your local PC (Windows) https://medium.com/@GalarnykMichael/install-spark-on-windows-pyspark-4498a5d8d66c

Why PySpark ?

Apache Spark is one of the most widely used frameworks when it comes to handling and working with Big Data and Python is one of the most widely used programming languages for Data Analysis, Machine Learning, and much more. So, why not use them together? This is where Spark with Python also known as PySpark comes into the picture.

PySpark MLib is a machine-learning library. It is a wrapper over PySpark Core to do data analysis using machine-learning algorithms. It works on distributed systems and is scalable. We can find implementations of classification, clustering, linear regression, and other machine-learning algorithms in PySpark MLib.

Initializing a Spark Session and importing necessary libraries

Default no of partitions in spark is 200, it can be changed based on your requirement. We can even repartition the data based on the columns.

example: dataframe1=dataframe.repartition(x) , x: can be no of partitions or even the column name on which you want to partition the data.

Dataframes in Spark

In Spark, a DataFrame is a distributed collection of rows under named columns. In simple terms, it can be referred as a table in relational database or an Excel sheet with Column headers. It has the following caharacteristics:

Immutable in nature : We can create DataFrame once but can’t change it. And we can transform a DataFrame after applying transformations.
Lazy Evaluations: Which means that a task is not executed until an action is performed. Action commands in spark : count(),collect(), aggregate(),reduce() etc
Distributed: DataFrame are distributed in nature.

Loading CSV data into Spark Dataframe and display the top 5 records

Data Preparation:
Joining stores info to the sales(train) data and obtain the schema of the dataset

To find any missing values i.e. nulls present in the Spark Dataframe

Feature Engineering

Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. It is fundamental to the application of machine learning and helps in increasing the accuracy of the model. It is really essential in creating the right features.I will be using Random Forest Regressor model for this data. So, I split the date into discrete components so the decision trees were able to make better guesses.

extracting month, year, day and week from the date column

Dealing with Categorical and Continous Features

Defining variable with categorical and continuous columns

Spark MLPipeline

A Spark Pipeline is specified as a sequence of stages, and each stage is either a Transformer or an Estimator. These stages are run in order, and the input DataFrame is transformed as it passes through each stage. The Pipeline API, introduced in Spark 1.2, is a high-level API for MLlib. Inspired by the popular implementation in scikit-learn, the concept of Pipelines is to facilitate the creation, tuning, and inspection of practical ML workflows.

Applying String Indexer for Categorical Data

String Indexer- Used to convert string columns into numeric.It encodes a string column of labels to a column of label indices. The indices are in [0, numLabels), ordered by label frequencies, so the most frequent label gets index 0. String Indexer Functioning is some what similar to Label Encoder from Scikit-Learn.

In the below code, Indexers is pipeline with a series of string Indexers applied on columns that are defined as the categorical variables

Standard Scaler on Continuous Values

Centering and Scaling happen independently on each feature by computing the relevant statistics on the samples in the training set. Mean and standard deviation are then stored to be used on later data using the transform method. Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual feature does not more or less look like standard normally distributed data.

For using the standard scaler in the spark the input data must be in the form of vectors. lets apply vector assembler before passing it into the model.

VectorAssembler is a transformer that combines a given list of columns into a single vector column. It is useful for combining raw features and features generated by different feature transformers into a single feature vector, in order to train ML models like logistic regression and decision trees.

Importing VectorAssembler and creating our Features

We must transform our data using the VectorAssembler function to a single column where each row of the DataFrame contains a feature vector. In order to predict, we need to select columns based on which we will then create our features column.

Random Forest :

Random Forest is an ensemble technique used for classification and regression. It operates by building a large number of decision trees at training and outputting the predicted value or class of the individual trees in order to reduce the risk of over fitting.

Training the model and predicting on test data

Regression Evaluator

-Available metrics
-Mean Squared Error
-Root Mean Squared Error
-Mean Absolute Error
-Co-efficient of determination(R2)
-Explained Variance

Summary

This article is to give a basic overview on PySpark and How it can be used for machine learning models. There is still scope for decreasing the error by tuning parameters. Fortunately, Spark’s MLib contains a Cross-Validator tool that makes tuning hyper-parameters a little less painful. The Cross-Validator can be used with any algorithm supported by MLib.

Source code that created this post can be found here.I would be pleased to receive feedback or questions on any of the above.