Apache Spark at a glance

Published in

cloudnesil

7 min readJul 17, 2019

This story was a presentation about Apache Spark and converted to a blog post. It is a general presentation of Apache Spark to development teams.

What is Apache Spark?

Apache Spark is a general purpose platform for quickly processing large scale data that is developed in Scala programming language.

A framework for distributed computing
In-memory, fault tolerant data structures
API that supports Scala, Java, Python, R, SQL
Open source

Apache Spark is a unified computing engine and a set of libraries for parallel data processing on computer clusters.

Unified: Unified platform for developing big data applications. Within the same platform you can perform data loading, SQL query, data transform, machine learning, streaming, computation. (RDD)
Computing Engine: perform computations over data somewhere (cloud, file, SQL database, Hadoop, Amazon S3 etc.). Not store it.
Libraries: self and external libraries. Core engine changes little. SparkSQL, MLLib, Streaming, GraphX. spark-packages.org

https://towardsdatascience.com/deep-learning-with-apache-spark-part-2-2a2938a36d35

Map-Reduce

Map: capture data in a convenient structure for data analysis

Reduce: make analysis at captured data.

Analogy to RDMS(Relation Data Management System):

Map: select the wanted fields with SELECT clause and filter with WHERE clause.
Reduce: make calculations on data with COUNT, SUM, HAVING etc. clauses.

http://devveri.com/hadoop/hadoop-mapreduce-ornek-uygulama

Why Apache Spark?

Performance

Development Productivity

Rich API

Apache Spark General Usage Areas

Interactive Query

Enterprise-scale data volumes accessible to interactive query for business intelligence (BI)
Faster time to job completion allows analysts to ask the “next” question about their data & business

Large-Scale Batch

Data cleaning to improve data quality (missing data, entity resolution, unit mismatch, etc.)
Nightly ETL processing from production systems

Complex Analytics

Forecasting vs. “Nowcasting” (e.g. Google Search queries analyzed en masse for Google Flu Trends to predict outbreaks)
Data mining across various types of data

Event Processing

Web server log file analysis (human-readable file formats that are rarely read by humans) in
near-real time
Responsive monitoring of RFID-tagged devices

Model Building

Predictive modeling answers questions of “what will happen?”
Self-tuning machine learning, continually updating algorithms, and predictive modeling

Apache Spark Structure and Architecture

Driver Process:

runs main function and user codes are in this process.
Maintain all relevant information about spark application
Give responses to user program and its inputs
Distribute the jobs to executors and put the jobs in order

Executor Process:

run the jobs assigned by Driver
Runs the code assigned by driver.
Inform the driver about the status of jobs that is assigned to it.

Cluster Manager:

Controls physical machine and allocate resources to spark applications.
Cluster Managers: spark standalone cm, YARN, Mesos, Kubernetes

Apache Spark APIs

Low Label APIs- RDD

Resilient Distributed Dataset
Basic Spark Abstraction
Virtually Everything on Spark is built on RDD
Data operations are performed on RDD
An immutable, partitioned collection of records that can be operated in parallel.
No schema, Row is a Java/Python Object.
Can be persistent in memory or disk.

https://www.slideshare.net/sparkInstructor/apache-spark-rdd-101

Apache Spark DataSources

Apache Spark Data Operations

Apache Spark Data Operations- Transformations

map(func): Return a new distributed dataset formed by passing each element of the source through a function func.
filter(func): Return a new dataset formed by selecting those elements of the source on which funcreturns true.
flatMap(func): Similar to map, but each input item can be mapped to 0 or more output items (so funcshould return a Seq rather than a single item).
groupByKey([numTasks]): When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs.
union(otherDataset): Return a new dataset that contains the union of the elements in the source dataset and the argument.
distinct([numTasks])): Return a new dataset that contains the distinct elements of the source dataset.

Apache Spark Data Operations- Actions

reduce(func): Aggregate the elements of the dataset using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel.
collect(): Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.
count(): Return the number of elements in the dataset.
first(): Return the first element of the dataset (similar to take(1)).
take(n): Return an array with the first n elements of the dataset.
takeSample(withReplacement, num, [seed]): Return an array with a random sample of num elements of the dataset, with or without replacement, optionally pre-specifying a random number generator seed.
saveAsTextFile(path): Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system. Spark will call toString on each element to convert it to a line of text in the file.
saveAsSequenceFile(path): Write the elements of the dataset as a Hadoop SequenceFile in a given path in the local filesystem, HDFS or any other Hadoop-supported file system. This is available on RDDs of key-value pairs that implement Hadoop’s Writable interface.

Apache Spark APIs- Low Label APIs-DAG

Transformations and Actions define an application’s Direct Acyclic Graph (DAG).
Using the DAG a physical execution plan is defined.
DAG Scheduler splits the DAG into multiple stages based on transformations.

Apache Spark APIs- Dataset DataFrame

Dataset

distributed collection of data
strong typed
uses SQL Engine
use Encoder for optimizing filtering, sorting and hashing without de-serializing the object

DataFrame

is a Dataset with named columns, Dataset[Rows]
equivalent of a relational database table (Schema)
not strongly typed
uses Catalyst optimizer on logical plan by pushing filtering and aggregations
uses Tungsten optimizer on physical plan by optimizing memory usage

Dataset and DataFrame were introduced In Spark 1.6 (DataFrame API as stable Dataset API as experimental ) Spark 2.X — Dataset API became stable

https://www.slideshare.net/databricks/structuring-spark-dataframes-datasets-and-streaming-62871797

Apache Spark APIs- SparkSQL

Provide for relational queries expressed in SQL, HiveQL and Scala
Seamlessly mix SQL queries with Spark programs
DataFrame/Dataset provide a single interface for efficiently working with structured data including Apache Hive, Parquet and JSON files
Leverages Hive frontend and metastore:
Compatibility with Hive data, queries, and UDFs
HiveQL limitations may apply
Not ANSI SQL compliant
Little to no query rewrite optimization, automatic memory management or sophisticated workload management
Graduated from alpha status with Spark 1.3
Standard connectivity through JDBC/ODBC

Apache Spark APIs- Streaming

Component of Spark
Project started in 2012
First alpha release in Spring 2013
Out of alpha with Spark 0.9.0
Discretized Stream (DStream) programming abstraction
Represented as a sequence of RDDs (micro-batches)
RDD: set of records for a specific time interval
Supports Scala, Java, and Python (with limitations)
Fundamental architecture: batch processing of datasets

Apache Spark APIs- Machine Learning

Spark ML for machine learning library
RDD-based package spark. mllib now in maintenance mode.
The primary API is now the DataFrame-based package spark.ml
Provides common algorithm and utilities
Classification
Regression
Clustering
Collaborative filtering
Dimensionality reduction
Basically spark ML provides you with a toolset to create “pipelines” of different machine learning related transformations on your data.
It makes it easy to for example to chain feature extraction, dimensionality reduction, and the training of a classifier into 1 model, which as a whole can be later used for classification.

Apache Spark APIs- GraphX

Flexible Graphing
GraphX unifies ETL, exploratory analysis, and iterative graph computation
You can view the same data as both graphs and collections, transform and join graphs with RDDs efficiently, and write custom iterative graph algorithms with the API
GraphX is Apache Spark’s API for graphs and graph-parallel computation.
http://graphframes.github.io/
https://spark.apache.org/graphx/
In addition to a highly flexible API, GraphX comes with a variety of graph algorithms
http://ampcamp.berkeley.edu/big-data-mini-course/graph-analytics-with-graphx.html

References

https://spark.apache.org/docs/latest/
https://databricks.com/resources
The Data Scientist’s Guide to Apache Spark, DataBricks
Spark: The Definitive Guide Big Data Processing Made Simple, Bill Chambers and Matei Zaharia
Introduction to Apache Spark — edX https://courses.edx.org/asset-v1:BerkeleyX+CS105x+1T2016+type@asset+block@Lecture1s.pdf
Intro to Apache Spark https://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf
Apache Spark 101 — Duke University https://www2.cs.duke.edu/courses/spring15/compsci516/Lectures/LanceIntroSpark.pdf
Big Data and Apache Spark info.cs.pub.ro/scoaladevara/prez2017/ubis2.pdf
Introduction to Apache Spark csis.pace.edu/ctappert/dps/d860–17/T2-Spark.pptx