Apache Spark 3.0— What’s new ?

Pradyumna kashyap
Big Data Processing
4 min readJan 27, 2020
Spark 3.0.0-

Spark was first developed as a data processing framework at UC Berkeley’s AMPLab by Matei Zaharia in 2009. In 2010, it became an open-source project under a Berkeley Software Distribution license. Later in 2013, this project was donated to the Apache Software Foundation, and the license was changed to Apache 2.0, hence forth is a major part of most of the Big Data applications.

Apache Spark is a unified analytics engine used mainly to process Big Data. It comprises of many built-in modules supporting streaming, SQL, Machine learning and Graph processing. Spark has hence seen rapid adoption by enterprises across a wide range of industries and internet powerhouses like Facebook, Hotels.com, Cisco, Microsoft and Netflix have deployed Spark at massive scale, processing multiple petabytes of data on clusters of over 8,000 nodes.

After the brief introduction about Spark, Let’s dive right into the new and exciting feature add-on’s introduced with Spark 3.0-

  1. Starting with language support, a major change in the Spark 3.0 edition is the deprecation of all Python 2.x versions throwing a warning of deprecation to users and prompting them to migrate to Python3 by making a migration guide available to the users. Scala is now upgraded to 2.12, in addition to this it will also fully support JDK 11.
  2. YARN — Auto discovery of GPUs on clusters or on a single node system. The below property should be enabled for the same to work. Users are also allowed to schedule GPUs.
YARN — Property settings to auto detect GPUs

3. A basic yet important feature update allows reading binary files directly as a data source for the spark data frames.

val df = spark.read.format(<Binary_file>).load(dir.getPath)

4. Dynamic Partition Pruning is a feature which is applicable to both logical and physical planning phases, the aim is to skip scanning of unwanted partitions while using joins. This feature is made possible by applying filter sets on the dimension table (which are smaller tables) in a broadcast hash join (Map side join) in combination with the fact table.

5. Spark 3.0 has interesting updates for Data engineers and Scientists as well, as it supports heterogeneous GPUs like AMD, Intel, and NVIDIA. It also offers GPU acceleration with NVIDIA which can run across multiple GPUs.

6. Execution of Spark SQL has some major improvements. When the ANALYZE command is called in the earlier versions of Spark, the optimization could be done only in the planning phase (for eg. this command would help decide weather or not a broadcast hash join or a Sort merge join should be opted for on the data) But with Spark 3.0, adaptive execution of Spark SQL is rolled out where the data can be examined at run time (even if the same could not be decided in the planning phase) by Spark and further opt-in to use a broadcast hash join over a expensive sort-join even after the data is loaded.

7. SparkGraph is a new module introduced in Spark 3.0 with major features of Graph processing. It includes support for Cypher query language (Neo4J) also known to the “SQL” for graphs. It is known to have its own catalyst optimizer allowing Cypher language on graph to be processed to a way similar to how SparkSQL operates.

8. Introduction of Delta Lake brings in ACID transactions to Apache Spark 3.0 making the Data Lakes more reliable. A simple way to put this is, the Delta lakes help users to focus on logic rather than inconsistencies as it handles simultaneous modifications by multiple modifiers.

9. DataSource V2 improvements are included with the Spark 3.0 like the plug-gable catalog integration and improved push down. In simple terms we could say it helps improve the performances as the basic data scan involves first scanning the data and then applying a filter on it where as this feature helps filter and then scan the filtered data.

df.writeTo(“catalog.db.table”).overwrite($”year” === “2019”)

10. Kubernetes support with the new Spark 3.0 is tremendous as the support for Kubernetes was primitive in the 2.x versions with major performance issues compared to YARN cluster manager. But the Spark 3.0 has major improvements in terms of the new shuffle service for Spark on Kubernetes which allows dynamic scaling and also supports GPU support with pod isolation for executors helping make scheduling more flexible on cluster with multiple GPUs.

With some exciting feature add-ons promised with the new Spark 3.0, the wait for the stable release awaits !!

References :

  1. Spark JIRAs — https://issues.apache.org/jira/browse/SPARK-26078?jql=statusCategory%20%3D%20done%20AND%20project%20%3D%2012315420%20AND%20fixVersion%20%3D%2012339177%20ORDER%20BY%20priority%20DESC%2C%20key%20ASC
  2. https://spark.apache.org/docs/3.0.0-preview/
  3. https://databricks.com/session_eu19/new-developments-in-the-open-source-ecosystem-apache-spark-3-0-delta-lake-and-koalas
  4. https://databricks.com/session_eu19/graph-features-in-spark-3-0-integrating-graph-querying-and-algorithms-in-spark-graph
  5. https://www.signifytechnology.com/blog/2019/08/a-glimpse-at-the-future-of-apache-spark-3-dot-0-with-deep-learning-and-kubernetes-with-oliver-white-and-holden-karau

--

--