C++ Engines and the Performant Future of Spark SQL

Published in

Intel Granulate

4 min readOct 31, 2023

Apache Spark has been a reliable framework for processing petabyte-scale datasets. However, the Spark community has continuously worked on addressing performance challenges requiring various optimizations over time.

The announcement of Photon by Databricks has been quite disruptive and the community is definitely taking an interest by combining Gluten with Velox, to offer an open source alternative with support from Intel and Kyligence.

Photon: The Next-Generation Engine for the Lakehouse

Photon is a part of the Databricks Lakehouse Platform and is designed to provide extremely fast query performance at a low cost for various workloads - data ingestion, ETL, streaming, data science, and interactive queries, directly on your data lake. It is compatible with Apache Spark APIs, which means no code changes are required to get started.

Photon is built from the ground up for the fastest performance at lower cost, providing up to 80% total cost of ownership (TCO) savings and up to 12x speedups for data and analytics workloads.

However, it's also nearly three times the cost, so in order to actually achieve those cost savings, the Photon engine needs to operate at least three times the speed. So, it’s recommended that customers check that their total cost of ownership is indeed lowered.

Key features of Photon include:

Compatibility with modern Apache Spark APIs: Works with existing code - SQL, Python, R, Scala, and Java.
ANSI-compliant engine: Ensures workloads run seamlessly without code changes.
Optimized for all use cases: Standardizes one set of APIs for all workloads - ETL, analytics, and data science, in batch or streaming.

Accelerate Spark SQL Queries with Gluten

Project Gluten is an open-source project that replaces Spark engine with multiple native engines, including the Meta-led Velox vectorized execution engine and a Clickhouse execution engine developed by Kyligence. With Gluten and Velox, Apache Spark users can expect performance gains and higher resource utilization.

Gluten connects Apache Spark and vectorized SQL engines or libraries, opening up numerous opportunities for optimizing, like offloading functions and operators to a vectorized library, introducing just-in-time compilation engines, and enabling the use of hardware accelerators (e.g., GPU and FPGA). These optimizations can improve performance by a range from 1.5X all the way to 8X.

Key components of Gluten include:

Plan Conversion: Converts Spark’s physical plan to a Substrait plan for each backend.
Fallback Processing: Leverages the existing Spark JVM engine to check if an operator is supported by the native library.
Memory Management: Leverages Spark’s existing memory management system.
Columnar Shuffle: Reuses Gazelle’s Apache Arrow-based Columnar Shuffle Manager.
Shim Layer: Supports multiple versions of Spark.
Metrics: Supports Spark’s Metrics functionality and extends it with a column-based API and additional metrics.

This open source, community-led solution will have major implications for the Spark market overall. These improvements could be integrated directly into managed offerings like AWS EMR, Dataproc and HDInsight, leading to immediate availability to all their customers. In fact, all data engineering teams that use Spark or are planning to do so in the future, will likely benefit from Gluten and should be preparing to incorporate it into their Big Data strategy for 2024.

Fast Gets Faster With Intel Granulate Autonomous Tuning

The introduction of Photon and Project Gluten represents a significant leap towards the performant future of Spark SQL. Photon, with its compatibility with Apache Spark APIs and optimizations for all data use cases and workloads, and Gluten, with its ability to offload Spark SQL queries to native engines, contribute to achieving higher performance gains and resource utilization.

Add to that the unique orchestration and runtime optimization abilities that Intel Granulate provides for Databricks workloads, including those with Photon, and the benefits for Data Engineers grow significantly. Granulate continuously and securely optimizes large-scale Databricks workloads to lower DBUs, improve job completion time, and cut cloud infrastructure costs with no code changes required.

Intel Granulate optimizes Databricks workloads with:

Runtime optimization to improve job completion times, reduce CPU, and increase throughput, for faster job completion time and more efficient data processing.
Dynamic capacity management to cut costs, streamline governance, and optimize workloads through node and DBU reduction.
Seamless Scaling, whether you use fixed clusters or utilize Databricks autoscaling.

As the processing of petabyte-scale datasets continues to be a challenge, these advancements will play a crucial role in optimizing data processing and analytics workloads.

C++ Engines and the Performant Future of Spark SQL

Photon: The Next-Generation Engine for the Lakehouse

Accelerate Spark SQL Queries with Gluten

Fast Gets Faster With Intel Granulate Autonomous Tuning

Written by Intel Granulate Tech Blog Team