Big Data Analytics with Spark in Azure HDInsight

‘Big data’ usually refers to the complexity of dealing with large datasets and the value of extracting insights from data. It is usually infeasible to perform the computations needed to draw these useful insights on a single machine. For example, the ClueWeb09 dataset is around 6TB in compressed state so if you wanted to run the PageRank algorithm on that dataset, chances are you would need to distribute the computation load among several machines. The main reason for scaling ‘out’ to several machines instead of scaling ‘up’ to a beefier machine is lower cost. However, parallelization has its own challenges: handling the communication between workers (e.g., to exchange state) and synchronizing access to shared resources (e.g., data). Parallelization is a difficult problem and can often be a headache for developers. Infrastructure tools such as Spark provide a reliable framework that makes use of the MapReduce programming model and this allows developers to work on the logic of the program instead of dealing with lower level issues such as threading and locking. These tools handle complicated problems such as task scheduling, synchronization and fault-tolerance in the case of machine failure. In this article, I will introduce the MapReduce programming model and point to a tutorial where you will deploy a Spark cluster on Azure HDInsight, interactively explore data with SparkSQL and train a machine learning model with SparkML using Jupyter notebook. Although it is not necessary to know the MapReduce model to work with SparkSQL or SparkML, it’s still helpful since these libraries are built on top of the core Spark engine that uses the MapReduce framework.

The stack of libraries in Spark built on top of the core engine.

It’s surprising to see how many problems in big data boil down to simple counting and taking averages. A typical problem has the following structure:
1) Iterate over a large number of records in a dataset
2) Extract something of interest from each record — MAP
3) Shuffle (and sort) intermediate results
4) Aggregate intermediate results — REDUCE
5) Generate final output

A MapReduce program consists of two methods: map() and reduce(). The map() method operates on a single record in the dataset and emits a (key, value) pair for each record. After the map phase, the framework partitions all the values with the same key to the same machine for the reduce phase. The reduce() method is called once for each key and an iterator is provided for all the values associated with that key. In SparkSQL, if you issue a query to count the frequency of words in a document, the SQL query gets executed as a MapReduce program that maps over each word in the text and emits the word as a key and a count of 1. Then, in the reduce phase, the program sums up the counts for each key (in this case, the word) and returns the frequency of the word in the document. This figure illustrates the program workflow in the Hadoop framework and the execution follows similarly in Spark except that Spark does not automatically sort the values in the shuffle phase but users have the option to sort the data through a function call. As shown in the figure, the words (‘a’, ‘b’, ‘c’) in the text were distributed to different machines where they got mapped to their “local” frequency. After the partition phase, the values for each unique key were accumulated together. Although, it’s not shown in the figure, the frequency of each word can be obtained by summing over the values for each word. For the word, ‘a’, we need to iterate over the values [1, 5] and sum them up to get 6 which is the frequency of the word ‘a’ in the text.

MapReduce workflow for a word count program in the Hadoop framework.

This is the link to a tutorial where you will get to explore and analyze a Food Inspection dataset using Spark SQL and then train a machine learning model to predict if restaurants will be successful or not. You will work on a Spark cluster that you deploy on Azure HDInsight and that should give you a real feel of working on big data problems in the ‘cloud’.

Github Tutorial: Data Analytics with Spark

Github Tutorial: Jupyter Notebooks

Acknowledgements: Some of the material presented in this article were taken from the slides of the Big Data course taught by Jimmy Lin. I would also like to thank Jeff Prosise for contributing the Github tutorial that I linked to.