Spark SQL Use Case — the Python Version
Over at acadgild.com, Kiran Krishna provided a nice example of how to use Spark SQL in his article: Spark SQL Use Case 911 Emergency Helpline Number Data Analysis. But there was one problem. I hadn’t used Scala since grad school and then I only used it for one class. And we didn’t even write whole programs, we only used it to show functional programming concepts.
For the past several months I have been using Python for a lot of different tasks. Writing small scripts to query databases and saving the results in an Excel file. Using Python for web programming using Flask and of course using pandas for general data crunching.
But then I find out about Apache Spark and want to learn its API for distributed data processing. I’ve read through most of the documentation but wanted something a bit “meatier” to get into. As a firefighter-paramedic that moonlights as a programmer and want-to-be quant, you can imagine my pleasure at finding Mr. Krishna’s article. But then there was the whole Scala issue.
So what better way to learn a new framework and its API than to work through an example and convert it from one language to another.
Mr. Krishna’s article does a great job of describing the data and what he wants to accomplish with it, so I won’t bother rehashing what he has already done. Is article also has links to down load the data.
Here is my Python version:
In his first code block, Mr. Krishna loads the data from a CSV file, removes the 1st line header info and creates a schema that describes the data.
I am going to do things a bit different. I am going to keep the headers and let Spark infer the schema. Before I do that I need to create a Spark SQL context. If using pyspark (or linking ipython to pyspark) it automatically gives you a Spark context,
sc. From there you can create a Spark SQL context using this line:
# create a spark sql context
sql_sc = SQLContext(sc)
Now load the the 911 data from a the CSV file. (You will have to download the data file which a link is provided in the acadgild article.)
e911_df = sql_sc.read.load('/path/to/file/911.csv',
Create an in memory temporary table (Spark documentation for createOrReplaceTempView)
Now do the same thing for the zipcode CSV file.
zipcode_df = sql_sc.read.load('/path/to/file/zipcode.csv',
For readability I like to create my query string separately and then pass the string to Spark SQL. The following query joins the two data sets on the zip codes contained in each set.
query = (
"select e.title, z.city, z.state from emergency_911 e "
"join zipcode_table z "
"on e.zip = z.zip "
build1 = sql_sc.sql(query)
The first problem statement is: what types of problems are prevalent and in which state?
Here is how to solve that in Python.
# split the title on spaces and
# grab the first element.
# [:-1] removes the colon (:) from the end of the first element
ps1 = ps.map(lambda x: (x,1)).reduceByKey(lambda a,b: a+b)
To view the result in pyspark (or ipython) type:
The 2nd problem statement is: What kind of problems are prevalent and in which city?
This can be achieved as follows:
ps2 = build1.rdd.map(lambda x: x.title.split(" ") + "-->"
ps3 = ps2.map(lambda x: (x,1))
.reduceByKey(lambda a,b: a+b)
.map(lambda: (x, x))
# combine and show the results
By going working through an example and converting it to a different code allowed me to get a little more comfortable with the Spark framework and also learn and understand functional programming a little better.
When I encountered an error or didn’t get the same results as Mr. Krishna it forced me to go searching for the solution. Not only did I learn how to get the solution but also gained a better understanding of a statement works the way it does.
Working through this by no means made me an expert in Spark but it sure gave me a better feel for the framework.
Maybe someone can convert his example using Java or R.