Photo by NASA on Unsplash

Use PySpark for Your Next Big Problem

Harnessing the power of big data

Tom Allport
The Startup
Published in
6 min readJun 15, 2019

--

Big Data is becoming a buzz word in all areas of business with its ability to answer more questions more completely and allows more confidence in the quality of the data. The question is how can you harness this power for your next problem?

Spark is a platform for cluster computing. A cluster is a set of loosely or tightly connected computers that work together to perform the same task. Spark lets you spread information and computations over clusters with multiple nodes (think of each node as a distinct computer). Splitting up your data makes working with huge datasets easier since each node only works with a small amount of data.

Simple Cluster Computing Architecture

Each node works on its own subset of the total data and also performs part of the total calculations required, so that both data processing and computation are performed in parallel over the nodes in the cluster. Parallel computation can make certain types of programming tasks much faster.

However, as the computation power increases the complexity of dealing with the data and creating meaningful solutions also increases. Deciding whether or not Spark is the best solution for your problem is not always clear cut, but some useful questions to ask are:

  • Is my data too big to work with on a single machine?
  • Can my calculations be easily parallelised?

Connecting to a cluster is the first step in using Spark. Usually, the cluster will be hosted on a remote machine that’s linked to all other nodes. There will be one computer, called the master that manages splitting up the data and the computations. The master is connected to the rest of the computers in the cluster, which are called slaves. The master sends data to the and calculations to the slaves to run, and they send their results back to the master. When you’re just getting started with Spark it’s simpler to just run a cluster locally — the principles remain the same though.

Spark’s core data structure is the Resilient Distributed Dataset (RDD). This is a low-level object that allows Spark to work its magic by splitting data across multiple nodes in the cluster. However, RDDs are difficult to work with directly, so for this article I will be working with its DataFrame abstraction which is built on top of RDDs.

The Spark DataFrame was designed to behave a lot like a SQL table (a table with variables in the columns and observations in the rows). Spark DataFrames are both considerably easier to understand and work with as well as being better optimised than RDDs for complicated operations.

When you start modifying and combining columns and rows of data, there are many ways to arrive at the same result, but some often take much longer than others. When using RDDs, it’s up to the data scientist to figure out the right way to optimise the query, but the beauty of using the Spark DataFrame implementation is that is has much of this optimisation built in.

You can always convert a Spark DataFrame to a pandas DataFrame using .toPandas() and vice versa the .createDataFrame() method takes a pandas DataFrame and returns a Spark DataFrame. However, it’s easy to work in PySpark without using pandas at all.

For this article I am going to be using the Default of Credit Card Clients Dataset provided by UCI Machine Learning. The dataset contains information on default payments, demographic factors, credit data, history of payment, and bill statements of credit card clients in Taiwan from April 2005 to September 2005.

First, we need to set up a SparkSession in our environment.

Now we import the data into our environment. You can read csv files and other formats using .read

A basic understanding of SQL is going to be helpful in managing Spark DataFrames as the operations are analogous.

# Add column to DataFrame
data = data.withColumn("colname",new_column)
# Filter data with a SQL string
filtered_data = data.filter("colname > value")
# Filter data with a boolean column
filtered_data2 = data.filter(data.colname > value)
# Select the first set of columns
selected1 = data.select("colname1", "colname2", "colname3")
# Group by column
grouped_data = data.groupBy("colname")
# Join the DataFrames
joined_data = data1.join(data2, on=”common_colname”, how=”join_type”)

Now we have the data, the next step is to build models to gain insight from the data. The first thing to note here is that PySpark needs numeric data to perform machine learning. To get data into numeric form there is the cast function which can be used to create new columns of the desired datatype.

For this article I won’t be doing any data preprocessing so I will skip to the last step in the Pipeline — combining all of the columns containing our features into a single column. This has to be done before modelling can take place because every Spark modelling routine expects the data to be in this form. You can do this by storing each of the values from a column as an entry in a vector. Then, from the model’s point of view, every observation is a vector that contains all of the information about it and a label that tells the modeller what value that observation corresponds to.

The pyspark.ml.feature submodule contains a class called VectorAssembler. This Transformer takes all of the columns you specify and combines them into a new vector column.

Pipeline is a class in the pyspark.ml module that combines all the Estimators and Transformers that you’ve already created. This lets you reuse the same modelling process over and over again by wrapping it up in one simple object.

Depending on the datatypes different methods will need to be included to the pipeline. PySpark has functions for handling strings built into the pyspark.ml.features submodule. You can create ‘one-hot vectors’ to represent unordered categorical data. A one-hot vector is a way of representing a categorical feature where every observation has a vector in which all elements are 0 except for at most one element, which has a value of 1. Each element in the vector corresponds to a level of the feature, so it’s possible to tell what the right level is by seeing which element of the vector is equal to one 1.

The first step to encoding your categorical feature is to create a StringIndexer. Members of this class are Estimators that take a DataFrame with a column of strings and map each unique string to a number. Then, the Estimator returns a Transformer that takes a DataFrame, attaches the mapping to it as metadata, and returns a new DataFrame with a numeric column corresponding to the string column.

The second step is to encode this numeric column as a one-hot vector using a OneHotEncoder. This works exactly the same way as the StringIndexer by creating an Estimator and then a Transformer. The end result is a column that encodes your categorical feature as a vector that’s suitable for machine learning routines!

# Create a StringIndexer
string_indexer = StringIndexer(inputCol=”inputCol”, outputCol=”outputCol”)
# Create a OneHotEncoder
one_encoder = OneHotEncoder(inputCol=”inputCol”, outputCol=”outputCol”)

Now split the data into train and test using then create an instance of the ml model of your choice.

Now we fit the model and then test it on the test set.

And that’s it, you’ve used PySpark to classify if someone is likely to default on their credit card payment, using a framework that allows you to scale this project up to use much larger datasets which might provide more accurate results.

Conclusion

PySpark is a great tool for performing machine learning on large datasets that are too big to run on a single machine and when the calculations can be easily parallelised. The workflow and code are very similar to any other python library and allows you to harness the power of big data and cluster computing.

Key vocab

Pyspark — Python implementation of Apache Spark

Cluster — a set of loosely or tightly connected computers that work together to perform the same task

Node — a computer used as a server to perform the tasks

Parallel computation — a type of computation in which many calculations or the execution of processes are carried out simultaneously.

RDD — Resilient distributed database

--

--