Databricks Photon Engine: Optimizing Data Workflows

4 min readMar 26, 2023

Maximizing Efficiency and Improving Decision-Making with Databricks Photon Engine

Databricks Photon Engine is a high-performance query execution engine designed to accelerate complex workloads on Delta Lake tables. It is built on Apache Spark and accelerates specialized SQL and machine learning workloads.

Photon Engine is an integral part of the Databricks Unified Data Analytics Platform, which includes other tools like Databricks Runtime, Databricks Workspace, and Databricks Delta. It leverages Databricks Delta Lake as its data storage layer and provides advanced optimizations for data processing.

Photon Engine fits into the Databricks ecosystem as a complementary tool that can be used alongside other Databricks Runtime features to improve data processing performance. For example, users can use Photon Engine to accelerate SQL queries on Delta Lake tables while using other Databricks Runtime features like Spark Streaming or Structured Streaming for real-time data processing.

Here’s an example code snippet in Python that shows how Photon Engine can be used to execute a high-performance SQL query on a Delta Lake table:

from pyspark.sql.functions import sum
from delta.tables import DeltaTable
# Load the Delta Lake table into a DataFrame
df = DeltaTable.forPath(spark, "/path/to/my/table").toDF()
# Use Databricks Photon to execute the query
result = df.groupBy("country") \
           .agg(sum("sales").alias("total_sales")) \
           .orderBy("total_sales", ascending=False)
# Display the result
result.show()

This code snippet demonstrates how Databricks Photon can perform advanced queries on Delta Lake tables, leveraging the high-performance features of Photon Engine.

The Databricks Photon Engine aims to tackle some of the restrictions of conventional data processing engines like Hadoop and Apache Spark. Below are some essential ways in which Photon Engine differs from these traditional engines:

Specialization: Photon Engine is intended to accelerate particular workloads such as SQL and machine learning. Hadoop and Apache Spark are more generic data processing engines that can manage a broad spectrum of workloads.
Query optimization: Photon Engine employs advanced optimization methodologies like columnar storage and vectorized execution to accelerate query execution. Conversely, Hadoop and Apache Spark depend on conventional row-based storage and execution.
Real-time processing: Photon Engine is engineered to support real-time data processing use cases that mandate low latency, high throughput, and response times in milliseconds. Hadoop and Apache Spark, contrarily, are primarily designed for batch processing.
Integration with Delta Lake: Photon Engine operates as a critical component of Databricks Delta Lake, which provides advanced data management features like ACID transactions, versioning, schema enforcement, and more.

Here’s an example Python code snippet that demonstrates how Photon Engine can be used to accelerate SQL queries on Delta Lake tables:

from pyspark.sql.functions import sum
from delta.tables import DeltaTable
# Load the Delta Lake table into a DataFrame
df = DeltaTable.forPath(spark, "/path/to/my/table").toDF()
# Use Databricks Photon to execute the query
result = df.groupBy("country") \
           .agg(sum("sales").alias("total_sales")) \
           .orderBy("total_sales", ascending=False)
# Display the result
result.show()

This code snippet showcases Photon Engine’s ability to accelerate SQL queries on Delta Lake tables, leveraging its advanced optimization techniques for high performance.

The Databricks Photon Engine is a performance-driven data processing engine that accelerates analytics and machine learning workloads. Below are some of the fundamental features and abilities of the Photon Engine:

Columnar storage: The Photon Engine employs columnar storage, implying that data is stored and processed by columns rather than rows. This can lead to substantial performance benefits for queries and analytics workloads.
Vectorized execution: The Photon Engine utilizes vectorized execution, processing data in batches instead of row-by-row. This can improve noteworthy performance, particularly for complex data processing duties.
GPU acceleration: The Photon Engine supports GPU acceleration, improving performance for machine learning and deep learning workloads even more.
Real-time data processing: The Photon Engine is optimized for real-time data processing, having low latency and high throughput, rendering it well-suited for streaming data workloads.
Integration with Delta Lake: The Photon Engine is closely intertwined with Delta Lake, providing transactional ACID guarantees, schema enforcement, and other advanced data management capabilities.

Here is an example code snippet in Python exhibiting how to optimize a machine-learning workflow using the Photon Engine:

import mlflow
import databricks.koalas as ks
import databricks.photon as photon
from sklearn.ensemble import RandomForestRegressor

# Load data into a Koalas DataFrame
df = ks.read_csv("/path/to/data.csv")
# Split data into training and test sets
train, test = df.random_split([0.7, 0.3])
# Train a random forest regression model using Photon Engine
with photon.session() as sess:
  data_train = sess.create_dataframe(train)
  data_test = sess.create_dataframe(test)
  x_train = data_train.drop(["target"], axis=1)
  y_train = data_train["target"]
  x_test = data_test.drop(["target"], axis=1)
  y_test = data_test["target"]
  model = RandomForestRegressor(n_estimators=100)
  model.fit(x_train, y_train)
  y_pred = model.predict(x_test)
# Log metrics and model in MLflow
mlflow.log_metric("r2_score", r2_score(y_test, y_pred))
mlflow.sklearn.log_model(model, "model")

In this example, we use the Photon Engine to train a random forest regression model on a Koalas DataFrame. By leveraging columnar storage and vectorized execution, we can improve the performance of our machine-learning workflow, resulting in faster training times and better model accuracy.

Databricks Photon Engine: Optimizing Data Workflows

Written by Ashwin