Apache Spark at a glance

Gökhan Gürgeç
cloudnesil
Published in
7 min readJul 17, 2019

This story was a presentation about Apache Spark and converted to a blog post. It is a general presentation of Apache Spark to development teams.

What is Apache Spark?

Apache Spark is a general purpose platform for quickly processing large scale data that is developed in Scala programming language.

  • A framework for distributed computing
  • In-memory, fault tolerant data structures
  • API that supports Scala, Java, Python, R, SQL
  • Open source

Apache Spark is a unified computing engine and a set of libraries for parallel data processing on computer clusters.

  • Unified: Unified platform for developing big data applications. Within the same platform you can perform data loading, SQL query, data transform, machine learning, streaming, computation. (RDD)
  • Computing Engine: perform computations over data somewhere (cloud, file, SQL database, Hadoop, Amazon S3 etc.). Not store it.
  • Libraries: self and external libraries. Core engine changes little. SparkSQL, MLLib, Streaming, GraphX. spark-packages.org
https://towardsdatascience.com/deep-learning-with-apache-spark-part-2-2a2938a36d35

Map-Reduce

Map: capture data in a convenient structure for data analysis

Reduce: make analysis at captured data.

Analogy to RDMS(Relation Data Management System):

  • Map: select the wanted fields with SELECT clause and filter with WHERE clause.
  • Reduce: make calculations on data with COUNT, SUM, HAVING etc. clauses.
http://devveri.com/hadoop/hadoop-mapreduce-ornek-uygulama

Why Apache Spark?

Performance

Development Productivity

Rich API

Apache Spark General Usage Areas

Interactive Query

  • Enterprise-scale data volumes accessible to interactive query for business intelligence (BI)
  • Faster time to job completion allows analysts to ask the “next” question about their data & business

Large-Scale Batch

  • Data cleaning to improve data quality (missing data, entity resolution, unit mismatch, etc.)
  • Nightly ETL processing from production systems

Complex Analytics

  • Forecasting vs. “Nowcasting” (e.g. Google Search queries analyzed en masse for Google Flu Trends to predict outbreaks)
  • Data mining across various types of data

Event Processing

  • Web server log file analysis (human-readable file formats that are rarely read by humans) in
    near-real time
  • Responsive monitoring of RFID-tagged devices

Model Building

  • Predictive modeling answers questions of “what will happen?”
  • Self-tuning machine learning, continually updating algorithms, and predictive modeling

Apache Spark Structure and Architecture

Driver Process:

  • runs main function and user codes are in this process.
  • Maintain all relevant information about spark application
  • Give responses to user program and its inputs
  • Distribute the jobs to executors and put the jobs in order

Executor Process:

  • run the jobs assigned by Driver
  • Runs the code assigned by driver.
  • Inform the driver about the status of jobs that is assigned to it.

Cluster Manager:

  • Controls physical machine and allocate resources to spark applications.
  • Cluster Managers: spark standalone cm, YARN, Mesos, Kubernetes

Apache Spark APIs

Low Label APIs- RDD

  • Resilient Distributed Dataset
  • Basic Spark Abstraction
  • Virtually Everything on Spark is built on RDD
  • Data operations are performed on RDD
  • An immutable, partitioned collection of records that can be operated in parallel.
  • No schema, Row is a Java/Python Object.
  • Can be persistent in memory or disk.
https://www.slideshare.net/sparkInstructor/apache-spark-rdd-101

Apache Spark DataSources

Apache Spark Data Operations

Apache Spark Data Operations- Transformations

  • map(func): Return a new distributed dataset formed by passing each element of the source through a function func.
  • filter(func): Return a new dataset formed by selecting those elements of the source on which funcreturns true.
  • flatMap(func): Similar to map, but each input item can be mapped to 0 or more output items (so funcshould return a Seq rather than a single item).
  • groupByKey([numTasks]): When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs.
  • union(otherDataset): Return a new dataset that contains the union of the elements in the source dataset and the argument.
  • distinct([numTasks])): Return a new dataset that contains the distinct elements of the source dataset.

Apache Spark Data Operations- Actions

  • reduce(func): Aggregate the elements of the dataset using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel.
  • collect(): Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.
  • count(): Return the number of elements in the dataset.
  • first(): Return the first element of the dataset (similar to take(1)).
  • take(n): Return an array with the first n elements of the dataset.
  • takeSample(withReplacement, num, [seed]): Return an array with a random sample of num elements of the dataset, with or without replacement, optionally pre-specifying a random number generator seed.
  • saveAsTextFile(path): Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system. Spark will call toString on each element to convert it to a line of text in the file.
  • saveAsSequenceFile(path): Write the elements of the dataset as a Hadoop SequenceFile in a given path in the local filesystem, HDFS or any other Hadoop-supported file system. This is available on RDDs of key-value pairs that implement Hadoop’s Writable interface.

Apache Spark APIs- Low Label APIs-DAG

  • Transformations and Actions define an application’s Direct Acyclic Graph (DAG).
  • Using the DAG a physical execution plan is defined.
  • DAG Scheduler splits the DAG into multiple stages based on transformations.

Apache Spark APIs- Dataset DataFrame

Dataset

  • distributed collection of data
  • strong typed
  • uses SQL Engine
  • use Encoder for optimizing filtering, sorting and hashing without de-serializing the object

DataFrame

  • is a Dataset with named columns, Dataset[Rows]
  • equivalent of a relational database table (Schema)
  • not strongly typed
  • uses Catalyst optimizer on logical plan by pushing filtering and aggregations
  • uses Tungsten optimizer on physical plan by optimizing memory usage

Dataset and DataFrame were introduced In Spark 1.6 (DataFrame API as stable Dataset API as experimental ) Spark 2.X — Dataset API became stable

https://www.slideshare.net/databricks/structuring-spark-dataframes-datasets-and-streaming-62871797

Apache Spark APIs- SparkSQL

  • Provide for relational queries expressed in SQL, HiveQL and Scala
  • Seamlessly mix SQL queries with Spark programs
  • DataFrame/Dataset provide a single interface for efficiently working with structured data including Apache Hive, Parquet and JSON files
  • Leverages Hive frontend and metastore:
  • Compatibility with Hive data, queries, and UDFs
  • HiveQL limitations may apply
  • Not ANSI SQL compliant
  • Little to no query rewrite optimization, automatic memory management or sophisticated workload management
  • Graduated from alpha status with Spark 1.3
  • Standard connectivity through JDBC/ODBC

Apache Spark APIs- Streaming

  • Component of Spark
  • Project started in 2012
  • First alpha release in Spring 2013
  • Out of alpha with Spark 0.9.0
  • Discretized Stream (DStream) programming abstraction
  • Represented as a sequence of RDDs (micro-batches)
  • RDD: set of records for a specific time interval
  • Supports Scala, Java, and Python (with limitations)
  • Fundamental architecture: batch processing of datasets

Apache Spark APIs- Machine Learning

  • Spark ML for machine learning library
  • RDD-based package spark. mllib now in maintenance mode.
  • The primary API is now the DataFrame-based package spark.ml
  • Provides common algorithm and utilities
  • Classification
  • Regression
  • Clustering
  • Collaborative filtering
  • Dimensionality reduction
  • Basically spark ML provides you with a toolset to create “pipelines” of different machine learning related transformations on your data.
  • It makes it easy to for example to chain feature extraction, dimensionality reduction, and the training of a classifier into 1 model, which as a whole can be later used for classification.

Apache Spark APIs- GraphX

References

--

--

Gökhan Gürgeç
cloudnesil

IT professional worked on various positions(test engineer, developer, project manager) of software development, passionate to good quality software development