Hacking with apache spark

Happy data science

(λx.x)eranga

Published in

Effectz.AI

6 min readMay 6, 2019

About spark

Apache spark is a distributed cluster computing framework which enables parallel process large amount of data. Spark built on top of Hadoop Distributed File System(HDFS). Instead of using Hadoop MapReduce framework spark uses it’s own parallel data processing framework which built based on Resilience Distributed Dataset(RDD).

RDD is the fundamental data structure of Apache Spark. It is an immutable collection of objects which computes on the different node of the cluster. Data in RDD partitioned and distributed among different nodes in the cluster. That means these data process on different nodes of the cluster. Unlike Hadoop MapReduce the data on RDD persist and processed on memory(if needed data can be persists on disk as well). Due to this reasons Apache Spark can be more fast and more flexible than Hadoop MapReduce. Spark framework comes with different components.

Spark Core (core engine of distributed data processing)
Spark stream (use to process realtime data)
Spark SQL (supports relational data processing with Spark SQL)
GraphX (spark API for graphs and graph-parallel computation)
MLlib (use for machine learning)
SparkR (R package that provides a distributed data frame implementation)

Spark can be integrated with various third party systems as well. Kafka, HDFS, Redis, Solr, Cassandra, Redis, Elasticsearch are some examples for the third party systems.

Spark architecture

Spark cluster follows master-slave architecture. It comes with Master(Driver), Worker(Executor) and Cluster Manager. There is single master in the spark cluster. Master node runs driver program which drives the spark application/spark job. Spark job is split into multiple tasks(these tasks comes with partitioned RDD) by master node and distributed over the worker nodes.

Worker/Executor nodes are the slave nodes whose job is to execute the tasks which assigns by master node. These tasks are executed on the partitioned RDD. Executor stores the computation results data in-memory, cache or on hard disk drives. After executing the task, worker node return the result to master node(master node aggregate the results from all worker nodes).

Cluster Manager does all the resource allocating works. It allocates the resources to worker nodes based on the tasks created by master. Then it distribute the tasks to worker nodes. Once task finish it take the results back from worker nodes to master node. Spark can work with various Cluster Managers, like Standalone Cluster Manager, Yet Another Resource Navigator (YARN), Mesos and Kubernetes.

Spark with scala

Spark programs/jobs can write with different programming languages such as Scala, Python, Java, R. Scala functional programming is the standard way to write spark programs. In this article I’m gonna show about various types of analytics that can be perform on larger data set with spark. The data set contains New York State Baby Names which are aggregated and displayed by the year, name, county, sex and count. I downloaded data set from here as a CSV file. I’m using scala functional programming language to do the analytics from this data set. All the source codes and data set which relates to this post published on this gitlab repository. Please clone the repository and continue the post.

Sbt dependency

I’m using IntelliJ Idea as my IDE to work with scala spark applications. First I need to create sbt project and add the build.sbt dependency file with spark dependencies. Following is the build.sbt file.

Spark context

Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object in the main program (driver program). In order to work with spark application I need to obtain SparkContext first.

Load data from CSV file

The data set I have downloaded contains the CSV records. Next I’m loading the data on CSV file to Row case class objects and cache in memory. The main reason to cache the record in memory is to prevent redoing the loading in all the time(notice that cache required more memory).

Spark analytics

Now data set is loaded into memory and environment is ready to do the analytics from the data set. Following are the various analytics I can do with baby names.

1. Unique names

Extract unique names in the CSV rows with distinct() function. Then sort them by name field.

Following is the output of this program, it prints only first 10 lines since I have include take(10).

A
A ALIYAH
A M JANSON
A MATEO
A MIAH
A RIYAH
A SHAAD
A VEYAH
A' ION
A' ISHA

2. Occurrences of Luke name

Extract no of rows(occurrences) which contains Luke name. Use filter() function to get the names which contains Luke.

Following is the output of this program. There are 262 occurrences of Luke name in the data set.

3. Occurrences of all names

Get total occurrences of all different names. Fist create a Map with name -> 1. Then sum-up with reduceByKey() function. Final result sort with name field.

Following is the output of this program. It prints only first 10 results since I have used take(10).

(A,2)
(A ALIYAH,1)
(A M JANSON,1)
(A MATEO,1)
(A MIAH,1)
(A RIYAH,1)
(A SHAAD,1)
(A VEYAH,1)
(A' ION,1)
(A' ISHA,1)

4. Names which occur more than 400 times

Fist get total occurrences of different names. Then filter the names which occur more than 400.

Following is the output of this program. There are 5 names in the data set which occur more than 400 times.

(JACOB,435)
(OLIVIA,421)
(ISABELLA,404)
(EMMA,445)
(LOGAN,456)

5. Top 10 occurring names

Fist get total occurrences of different names. Then sort the result with occurrence count(descending order) and take the first 10 with take(10).

Following is the output of this program. This output is sorted with occurrence count.

(MICHAEL,13820)
(JACOB,12520)
(ISABELLA,12272)
(MATTHEW,12154)
(SOPHIA,12095)
(JOSEPH,11797)
(ETHAN,11605)
(JAYDEN,11529)
(OLIVIA,11376)
(DANIEL,11283)

6. Total count of Luke name

CSV rows contains count field. First filter the names which contains Luke. Then sum-up the count with reduceByKey().

Following is the output of this program. There are 13 difference Luke names exists in the data set.

(LUKE TYRION,1)
(LUKE-RYAN,1)
(LUKES,2)
(LUKESH,1)
(JOHN LUKE,1)
(CALEB LUKE,1)
(LUKE,3895)
(LUKE DARYL,1)
(LUKE TYLER,1)
(LUKE BRYAN,1)
(LUKE ANGELO,1)
(LUKEN,1)
(JOHN-LUKE,3)

7. Top 10 Luke name counts

First filter the names which contains Luke. Then sort the result with count (descending order) field and take first 10 records.

Following is the output of this program. The output sorted with descending order of count field in the data set. It prints only first 10 results since I have used take(10).

Row(2007,LUKE,SUFFOLK,M,63)
Row(2014,LUKE,Nassau,M,61)
Row(2009,LUKE,NASSAU,M,61)
Row(2012,LUKE,SUFFOLK,M,60)
Row(2010,LUKE,NASSAU,M,60)
Row(2010,LUKE,SUFFOLK,M,60)
Row(2007,LUKE,NASSAU,M,58)
Row(2016,LUKE,Nassau,M,55)
Row(2015,LUKE,Nassau,M,53)
Row(2011,LUKE,SUFFOLK,M,53)

8. Top 10 baby born years

Fist create a Map with year -> count. Then sum up with reduceByKey(). Finally sort the result with count(descending order) and take the first 10 records.

Following is the output of this program. The output sorted with descending order of count field.

(2015,237377)
(2016,227448)
(2014,126377)
(2008,120472)
(2007,120119)
(2009,118580)
(2010,116458)
(2011,115277)
(2012,114036)
(2013,112109)

9. Top 10 male baby born years

CSV rows contains Sex field. First filter the rows which contains M(Male) as sex. Then take Map with year -> count and sum up with reduceByKey(). Finally sort the result with count(descending order) and take the first 10 records.