Processing the BIGG data

Sai Beathanabhotla
Holler Developers
Published in
9 min readSep 13, 2021

Data wrangling is the process of transforming and mapping data from one “raw” data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics. The goal of data wrangling is to assure both the quality and the usefulness of the data. Data wrangling typically follows a set of general steps which begin with extracting the data in a raw form from the data source, “munging” the raw data (e.g. sorting) or parsing the data into predefined data structures, and finally depositing the resulting content into a data sink for storage and future use. Data Wrangling is one of the most crucial steps for any Data Science or Machine Learning project and it’s the longest too.

One of the downstream purposes of data wrangling, as mentioned above, is data analysis.

Data analysis is a process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making. In today’s business world, data analysis (descriptive analytics, exploratory data analysis, predictive analytics) plays a role in making decisions more scientifically and helping businesses operate more effectively.

In this article, we focus on processing big data. We first will take a look at what Pandas offers in data processing, limitations of Pandas and then will dive into Apache Spark for processing big data.

Pandas is one of the most famous data science tools and it’s definitely a game-changer for wrangling and analyzing data. Pandas is a fast, powerful, flexible, and easy-to-use open-source data analysis and manipulation tool.

Let us now look at a few examples of pandas.

Reading the data from a CSV file

We can either read from a CSV file or create a Dataframe directly. For simplicity, we will create a Dataframe manually and work with a small dataset in this article (but, including the code to read from a CSV file too).

import pandas as pd
data = pd.read_csv(path_to_file)
data.head()
# Creating a DataFrame manually
d = {'id': [1, 1, 1, 1, 2, 2, 2, 2, 2 ,2], 'words': ['hello', 'how are you', 'pizza', 'thank you', 'lol', "what's up", 'beer', 'i love you', 'good morning', 'rent'], 'value': [100, 300, 50, 10, 100, 344, 588, 22, 211, 323]}
data = pd.DataFrame(data=d)

Grouping by a certain column to generate counts and aggregations

data_grouped = data.groupby('id').count().reset_index()
Group by operation on the DataFrame

Applying Lambda functions on the data frame

Applying a small lambda function that returns the number of words in a string by splitting it on whitespace.

def number_of_words(text):
return len(text.split(" "))
data[‘total_words’] = data['words'].apply(lambda x : number_of_words(x))
data.head()
Created ‘total_words’ column using lambda function

We apply all these operators/functions as part of our day-to-day analysis of the data. Let’s say we have a dataset of 100K records. We will not have any issues running all the above code on it. Now, let’s say we have a dataset of 1 million records. We might still not have any issues but it might take some time to finish the task. Let’s say we now have a dataset of 1 billion records. We would get memory issues considering the amount of memory Pandas needs since it has limitations on the size of the data it processes.

WE NEED TO VERTICALLY SCALE THE RESOURCES IN ORDER TO PERFORM THE TASK WHICH IS NOT ADVISABLE AND EFFICIENT.

Here comes Apache Spark!!

Apache Spark is a distributed data processing platform specialized for big data applications. It has become the de-facto standard in processing big data. By its distributed and in-memory working principle, it is supposed to perform fast by default.

Let’s take a look at the features of Spark.

  • Swift Processing: Using Apache Spark, we achieve a high data processing speed of about 100x faster in memory and 10x faster on the disk. This is made possible by reducing the number of read-write to disk.
  • Dynamic in Nature
  • In-Memory Computation in Spark: With in-memory processing, we can increase the processing speed. Here the data is being cached so we need not fetch data from the disk every time thus the time is saved. Spark has a DAG execution engine that facilitates in-memory computation and acyclic data flow resulting in high speed.
  • Reusability: We can reuse the Spark code for batch-processing, join streams against historical data or run ad-hoc queries on stream state.
  • Lazy Evaluation in Apache Spark
  • Fault Tolerance in Spark: Apache Spark provides fault tolerance through Spark abstraction-RDD. Spark RDDs are designed to handle the failure of any worker node in the cluster. Thus, it ensures that the loss of data reduces to zero.

Execution Model

In contrast to Pandas, Spark uses a lazy execution model. So when you apply some transformation to a DataFrame, the data is not processed immediately. Transformations are lazy in nature and execution will not start until an action is triggered. Let us take a look at how Spark stages tasks internally:

  1. A logical execution plan is created from the transformations which essentially is a graph (a DAG) where nodes represent operations (like reading data or applying a transformation).
  2. Now an analyzed execution plan is derived where all column names and external references to data sources are checked and resolved.
  3. Spark iteratively applies the available optimization strategies to transforms the analyzed execution plan into an optimized execution plan. For example, filter operations are pushed as near as possible to the data sources in order to reduce the number of records as early as possible.
  4. Finally, the result of the previous step is transformed into a physical execution plan where transformations are pipelined together into so-called stages.
  5. The physical execution plan then is split up along the data into so-called tasks, which then can be executed in parallel and distributed by the machines in the cluster.

It is important to understand that by lazily following execution plans, the total working data set does not ever completely materialize into RAM at any point in time. Instead, all the data (including input data and intermediate results) is split up into small chunks, which are processed independently and even results are eventually stored within small chunks and those never need to fit into RAM at once.

Spark Architecture Overview

The Apache Spark framework uses a master-slave architecture that consists of a driver, which runs as a master node, and many executors that run across as worker nodes in the cluster. Spark can also be used for batch processing and real-time processing.

High-level Spark Architecture

The driver is responsible for analyzing, distributing, monitoring, scheduling, and maintaining all the necessary information during the lifetime of the spark process. Executors are only responsible for executing the part of the code assigned to them by the driver and reporting the status back to the driver. Each spark process would have a separate driver and exclusive executors.

Deep Dive into Spark Architecture

The driver program, in the master node, drives our application. The code you are writing behaves as a driver program or if you are using the interactive shell, the shell acts as the driver program.

Inside the driver program, the first thing you do is, create a SparkContext. SparkContext is the entry point of Spark functionality. The most important step of any Spark driver application is to generate SparkContext.It is similar to your database connection. Any command you execute in your database goes through the database connection. Likewise, anything you do on Spark goes through Spark context.

Now, this Spark context works with the cluster manager to manage various jobs. The driver program & Spark context takes care of the job execution within the cluster. A job is split into multiple tasks which are distributed over the worker node.

Worker nodes are the slave nodes whose job is to basically execute the tasks. These tasks are then executed on the partitioned data in the worker node and hence return back the result to the Spark Context.

​​To combine it all together, in short, SparkContext takes the job, breaks the job into tasks, and distributes them to the worker nodes. These tasks work on the partitioned RDD/DataFrame, perform operations, collect the results and return to the main Spark Context.

Now that we have learned about Spark and its working, let us take a look at some examples of data aggregations, manipulations in PySpark.

Reading a CSV file into a Spark data frame

We can create a Spark Dataframe by reading from a CSV file or manually too. Let’s convert the Pandas Dataframe we created above into a Spark Dataframe.

# importing required libraries
import
pyspark.sql.functions as pf
from pyspark.sql.types import *
from pyspark.sql import Window
from pyspark.sql.functions import pandas_udf, PandasUDFType

data_spark_csv = spark.read.csv(path_to_file, header=True)
display(data_spark_csv)
# Converting Pandas Dataframe to a Spark Dataframe
data_spark = spark.createDataFrame(data)
Spark DataFrame created by converting a pandas DataFrame

Anytime a DataFrame is created in Spark context, it will be distributed across various nodes and can be cached there. And any operations/transformations we do on the DataFrame’s will also be distributed across the worker nodes.

Grouping by to perform aggregations

data_grouped = data_spark.groupby("id").count()
display(data_grouped)
data_group2 = data_spark.groupby('id').agg(pf.sum('count'))
display(data_group2)
data_group3 = data_spark.groupby('id').agg(pf.collect_set('words').alias('words_list'))
display(data_group3)
collect_set function to collect all the ‘words’ into a list in a group

User-Defined Functions

Spark also provides features for User-Defined Functions that can be used in data manipulation and data analysis. We can write UDF's to work on each row and generate a single value for each of them.

def length_of_words(term):
return len(term.split(" "))
length_of_word_udf = udf(lambda x : length_of_words(x), IntegerType())
data_spark = data_spark.withColumn('length_of_word', length_of_word_udf(data_spark.column1))

Window Functions

Spark also provides features for Window Functions which can operate on a group of rows (like frame, partition) and return a single value for every input row.

w = Window.partitionBy('id').orderBy('value')
data_spark_words_list = dd.withColumn('words_list', pf.collect_list('words').over(w)).withColumn('value_list', pf.collect_list('value').over(w))
display(data_spark_words_list)
Window function to collect all words for each window so far

Grouped Map Pandas UDFs

Grouped map Pandas UDFs are designed for split-apply-combine patterns in data analysis, and they operate on all the data for some groups, e.g., “for each ID, apply this operation”. Grouped map Pandas UDFs first splits a Spark DataFrame into groups based on the conditions specified in the Groupby operator, applies a user-defined function (pandas.DataFrame -> pandas.DataFrame) to each group, combines, and returns the results as a new Spark DataFrame. Pandas UDFs optimized and faster for grouped operations.

schema = StructType([
StructField("id", IntegerType()),
StructField("words", StringType()),
StructField("value", IntegerType()),
StructField("total_words", IntegerType()),
StructField("mean_subtracted", IntegerType())
])@pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def subtract_mean(pdf):
return pdf.assign(mean_subtracted=pdf.value - pdf.value.mean())
data_pandas_udf = data_spark.groupby('id').apply(subtract_mean)
display(data_pandas_udf)
Created mean_subtracted column using Pandas UDF

The best part here is that the data is distributed to all the worker nodes and aggregations are done on big data and results can be achieved efficiently using Spark.

P.S: The goal of the article is just to show the functionality of Spark and what it offers (with some examples). The data used in this article is a very small one. So if you want to test them out, try them on large-scale datasets.

Conclusion

Spark is inherently multi-threaded and can make use of all cores of your machine. By breaking down the total amount of work into individual tasks, which then can be processed independently (as soon as each task’s input data is available) in parallel, Spark makes very efficient use of the available cluster resources.

Spark also scales very well with huge amounts of data. It does not only scale via multiple machines in a cluster, but the ability to spill intermediate results to disk is at the core of the design of Spark. Therefore Spark is almost never limited by the total amount of main memory, but only by the total amount of available disk space.

The virtually unlimited scalability both in terms of data and processing power makes Spark a distributed and parallel relational execution engine for processing big data.

--

--