Apache Spark at a glance
This story was a presentation about Apache Spark and converted to a blog post. It is a general presentation of Apache Spark to development teams.
What is Apache Spark?
Apache Spark is a general purpose platform for quickly processing large scale data that is developed in Scala programming language.
- A framework for distributed computing
- In-memory, fault tolerant data structures
- API that supports Scala, Java, Python, R, SQL
- Open source
Apache Spark is a unified computing engine and a set of libraries for parallel data processing on computer clusters.
- Unified: Unified platform for developing big data applications. Within the same platform you can perform data loading, SQL query, data transform, machine learning, streaming, computation. (RDD)
- Computing Engine: perform computations over data somewhere (cloud, file, SQL database, Hadoop, Amazon S3 etc.). Not store it.
- Libraries: self and external libraries. Core engine changes little. SparkSQL, MLLib, Streaming, GraphX. spark-packages.org
Map-Reduce
Map: capture data in a convenient structure for data analysis
Reduce: make analysis at captured data.
Analogy to RDMS(Relation Data Management System):
- Map: select the wanted fields with SELECT clause and filter with WHERE clause.
- Reduce: make calculations on data with COUNT, SUM, HAVING etc. clauses.
Why Apache Spark?
Performance
Development Productivity
Rich API
Apache Spark General Usage Areas
Interactive Query
- Enterprise-scale data volumes accessible to interactive query for business intelligence (BI)
- Faster time to job completion allows analysts to ask the “next” question about their data & business
Large-Scale Batch
- Data cleaning to improve data quality (missing data, entity resolution, unit mismatch, etc.)
- Nightly ETL processing from production systems
Complex Analytics
- Forecasting vs. “Nowcasting” (e.g. Google Search queries analyzed en masse for Google Flu Trends to predict outbreaks)
- Data mining across various types of data
Event Processing
- Web server log file analysis (human-readable file formats that are rarely read by humans) in
near-real time - Responsive monitoring of RFID-tagged devices
Model Building
- Predictive modeling answers questions of “what will happen?”
- Self-tuning machine learning, continually updating algorithms, and predictive modeling
Apache Spark Structure and Architecture
Driver Process:
- runs main function and user codes are in this process.
- Maintain all relevant information about spark application
- Give responses to user program and its inputs
- Distribute the jobs to executors and put the jobs in order
Executor Process:
- run the jobs assigned by Driver
- Runs the code assigned by driver.
- Inform the driver about the status of jobs that is assigned to it.
Cluster Manager:
- Controls physical machine and allocate resources to spark applications.
- Cluster Managers: spark standalone cm, YARN, Mesos, Kubernetes
Apache Spark APIs
Low Label APIs- RDD
- Resilient Distributed Dataset
- Basic Spark Abstraction
- Virtually Everything on Spark is built on RDD
- Data operations are performed on RDD
- An immutable, partitioned collection of records that can be operated in parallel.
- No schema, Row is a Java/Python Object.
- Can be persistent in memory or disk.
Apache Spark DataSources
Apache Spark Data Operations
Apache Spark Data Operations- Transformations
- map(func): Return a new distributed dataset formed by passing each element of the source through a function func.
- filter(func): Return a new dataset formed by selecting those elements of the source on which funcreturns true.
- flatMap(func): Similar to map, but each input item can be mapped to 0 or more output items (so funcshould return a Seq rather than a single item).
- groupByKey([numTasks]): When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs.
- union(otherDataset): Return a new dataset that contains the union of the elements in the source dataset and the argument.
- distinct([numTasks])): Return a new dataset that contains the distinct elements of the source dataset.
Apache Spark Data Operations- Actions
- reduce(func): Aggregate the elements of the dataset using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel.
- collect(): Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.
- count(): Return the number of elements in the dataset.
- first(): Return the first element of the dataset (similar to take(1)).
- take(n): Return an array with the first n elements of the dataset.
- takeSample(withReplacement, num, [seed]): Return an array with a random sample of num elements of the dataset, with or without replacement, optionally pre-specifying a random number generator seed.
- saveAsTextFile(path): Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system. Spark will call toString on each element to convert it to a line of text in the file.
- saveAsSequenceFile(path): Write the elements of the dataset as a Hadoop SequenceFile in a given path in the local filesystem, HDFS or any other Hadoop-supported file system. This is available on RDDs of key-value pairs that implement Hadoop’s Writable interface.
Apache Spark APIs- Low Label APIs-DAG
- Transformations and Actions define an application’s Direct Acyclic Graph (DAG).
- Using the DAG a physical execution plan is defined.
- DAG Scheduler splits the DAG into multiple stages based on transformations.
Apache Spark APIs- Dataset DataFrame
Dataset
- distributed collection of data
- strong typed
- uses SQL Engine
- use Encoder for optimizing filtering, sorting and hashing without de-serializing the object
DataFrame
- is a Dataset with named columns, Dataset[Rows]
- equivalent of a relational database table (Schema)
- not strongly typed
- uses Catalyst optimizer on logical plan by pushing filtering and aggregations
- uses Tungsten optimizer on physical plan by optimizing memory usage
Dataset and DataFrame were introduced In Spark 1.6 (DataFrame API as stable Dataset API as experimental ) Spark 2.X — Dataset API became stable
Apache Spark APIs- SparkSQL
- Provide for relational queries expressed in SQL, HiveQL and Scala
- Seamlessly mix SQL queries with Spark programs
- DataFrame/Dataset provide a single interface for efficiently working with structured data including Apache Hive, Parquet and JSON files
- Leverages Hive frontend and metastore:
- Compatibility with Hive data, queries, and UDFs
- HiveQL limitations may apply
- Not ANSI SQL compliant
- Little to no query rewrite optimization, automatic memory management or sophisticated workload management
- Graduated from alpha status with Spark 1.3
- Standard connectivity through JDBC/ODBC
Apache Spark APIs- Streaming
- Component of Spark
- Project started in 2012
- First alpha release in Spring 2013
- Out of alpha with Spark 0.9.0
- Discretized Stream (DStream) programming abstraction
- Represented as a sequence of RDDs (micro-batches)
- RDD: set of records for a specific time interval
- Supports Scala, Java, and Python (with limitations)
- Fundamental architecture: batch processing of datasets
Apache Spark APIs- Machine Learning
- Spark ML for machine learning library
- RDD-based package spark. mllib now in maintenance mode.
- The primary API is now the DataFrame-based package spark.ml
- Provides common algorithm and utilities
- Classification
- Regression
- Clustering
- Collaborative filtering
- Dimensionality reduction
- Basically spark ML provides you with a toolset to create “pipelines” of different machine learning related transformations on your data.
- It makes it easy to for example to chain feature extraction, dimensionality reduction, and the training of a classifier into 1 model, which as a whole can be later used for classification.
Apache Spark APIs- GraphX
- Flexible Graphing
- GraphX unifies ETL, exploratory analysis, and iterative graph computation
- You can view the same data as both graphs and collections, transform and join graphs with RDDs efficiently, and write custom iterative graph algorithms with the API
- GraphX is Apache Spark’s API for graphs and graph-parallel computation.
- http://graphframes.github.io/
- https://spark.apache.org/graphx/
- In addition to a highly flexible API, GraphX comes with a variety of graph algorithms
- http://ampcamp.berkeley.edu/big-data-mini-course/graph-analytics-with-graphx.html
References
- https://spark.apache.org/docs/latest/
- https://databricks.com/resources
- The Data Scientist’s Guide to Apache Spark, DataBricks
- Spark: The Definitive Guide Big Data Processing Made Simple, Bill Chambers and Matei Zaharia
- Introduction to Apache Spark — edX https://courses.edx.org/asset-v1:BerkeleyX+CS105x+1T2016+type@asset+block@Lecture1s.pdf
- Intro to Apache Spark https://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf
- Apache Spark 101 — Duke University https://www2.cs.duke.edu/courses/spring15/compsci516/Lectures/LanceIntroSpark.pdf
- Big Data and Apache Spark info.cs.pub.ro/scoaladevara/prez2017/ubis2.pdf
- Introduction to Apache Spark csis.pace.edu/ctappert/dps/d860–17/T2-Spark.pptx