Azure Synapse, Databricks, and Azure Data Explorer (Kusto)

Gor Hayrapetyan
7 min readSep 24, 2021

--

Azure Synapse

Azure Synapse started as a cloud data warehousing solution but recently evolved into a multipurpose data processing solution. Not only it allows to store data inside dedicated SQL pools, but also using Synapse Studio we can do ad-hoc analysis, data processing, create and schedule pipelines and event create PowerBI reports. Basically, Azure Synapse Studio is providing a unified environment for end-to-end data analysis, from raw data on ADLS to PowerBI dashboards. Also, Spark notebooks are first-class citizens together with SQL. With Spark pools and notebooks, we can utilize a Spark cluster while paying only for the time that we use them. Also, Spark notebooks support .NET for Spark so you can reuse C# code. These notebooks and other artifacts can be versioned with Git. Synapse studio provides integration with Git like ADF, and whatever you do in ADF can be in Synapse Studio.

Dedicated and Serverless pools in Azure Synapse

With decoupled storage and compute, when using Synapse SQL one can benefit from independent sizing of compute power irrespective of your storage needs. For serverless SQL pool scaling is done automatically, while for dedicated SQL pool one can:

  • Grow or shrink compute power, within a dedicated SQL pool, without moving data.
  • Pause compute capacity while leaving data intact, so you only pay for storage.
  • Resume compute capacity during operational hours.

A serverless SQL pool allows you to query your data lake files, while a dedicated SQL pool allows you to query and ingest data from your data lake files.

While dedicated SQL poll serves as a more classical data warehouse, serverless SQL pool allows to do ad-hoc analysis and processing on top of a logical data warehouse.

Every Azure Synapse Analytics workspace comes with serverless SQL pool endpoints that you can use to query data in the Azure Data Lake (Parquet, Delta Lake, delimited text formats) and Cosmos DB.

Please refer to the dedicated and serverless SQL pools feature comparison page for more details.

In contrast to SQL pools Apache Spark pool is more convenient for data engineering and machine learning use-cases. You can utilize Sparks’s built-in machine learning library MLib. Also, Spark can query data not only stored in ADLS as Parquet, Avro, ORC, delimited text format, delta lake, and many more but also connect to other databases such as ADX. It is easy to create data pipelines using Spark notebooks together with orchestration functionality inside Synapse studio. Azure Synapse allows sharing databases and tables between its serverless Apache Spark pools and serverless SQL pool. Spark pool supports the latest Apache Spark runtime 3.1 and is compatible with Delta Lake.

And finally with Synapse Link allows you to run near real-time analytics over operational data in Azure Cosmos DB. Azure Synapse Analytics currently supports Synapse Link with Synapse Apache Spark and serverless SQL pool.

While choosing a data store it is important to keep in mind data locking. We can see that Synapse provides reach functionality and separates computes and storage with support of open-source data formats. Thus, it does not lock the data, unless you mainly keep it inside SQL dedicates pools.

In terms of streaming, data can be delivered either to ADSL or SQL dedicated pools via Azure Stream Analytics.

Azure Databricks

Azure Databricks is Spark based analytics platform. It offers three environments depending on a use-case:

  • Databricks SQL (Preview)
  • Databricks Data Science and Engineering (known as “Databricks Workspace”)
  • Databricks Machine Learning (Preview)

Databricks Data Science & Engineering provides an interactive workspace that enables collaboration between data engineers, data scientists, and machine learning engineers. For a big data pipeline, the data (raw or structured) is ingested into Azure through Azure Data Factory in batches or streamed near real-time using Apache Kafka, Event Hub, or IoT Hub. This data lands in a data lake for long-term persisted storage, in Azure Blob Storage or Azure Data Lake Storage. As part of your analytics workflow, use Azure Databricks to read data from multiple data sources and turn it into breakthrough insights using Spark.

Azure Databricks workspace

There are several differences between Databricks workspace and Synapse Spark poll. Databricks support classical set languages for Spark API: Python, Scala, Java, R, and SQL. While Synapse supports Python, Scala, SQL, and C#. Another important difference is the runtime. Synapse is using Azure Synapse for Spark which is based on Apache Spark but optimized by the Synapse team. Meanwhile, Databricks is using its own custom runtime again based on Apache Spark but optimized by Databricks (creators of Spark). There are some benchmarks for each one compared to vanilla Apache Spark but I am not aware of Synapse vs Databricks runtime comparison.

Also, you can directly stream data into Azure Databricks using EventHub thanks to the Spark connector for EventHub provided by Azure.

Databricks workspace is great for working with Delta Lake. Delta lake is a modern approach for building data lakes along with Apache Hudi and Iceberg. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake runs on top of your existing data lake and is fully compatible with Apache Spark APIs.

As you already know, Apache Spark comes with machine learning libraries, but ML lifecycle is not only about model training. And as ML becomes more mature as an engineering discipline, new tools and practices emerge to support it. Databricks is among contributors to this effort, and they very much focus on the machine learning use-case of their platform. I think this is a reason behind splitting original Azure Databricks into multiple environments.

Databricks Machine Learning

According to documentation, Databricks Machine Learning (Preview) is an integrated end-to-end machine learning platform incorporating managed services for experiment tracking, model training, feature development and management, and feature and model serving.

With Databricks Machine Learning, you can:

Databricks ML also comes with a custom runtime. It automates ML optimized cluster creation and includes popular ML libraries (TensorFlow, PyTorch, Keras, XGBoost, Horovod).

Finally, Databricks SQL allows you to run quick ad-hoc SQL queries on your data lake and build dashboards using fully managed SQL endpoints sized according to query latency and the number of concurrent users.

To sum up, it is hard to say whether Databricks is competing with Azure Synapse or completing it. However, since most functionality of Synapse is new compared to Databricks it seems the Azure intended to bring Spark based analytics platform to Azure family before it will build its alternative. Anyway, at the moment most architecture guidelines do not consider them as competitors but more as complementary tools. However, it is subject to change.

Azure Data Explorer (Kusto)

Azure Data Explorer serves a different role in comparison to Azure Databricks and Synapse. Its main purpose is interactive analytics of structured and unstructured data such as logs and telemetry. Not only data can be ingested from streaming sources such as EventHub, but also it becomes immediately available for querying after ingestion. ADX can serve also as backing storage for your dashboards such as Grafana and PowerBI.

ADX is not designed for data processing or modification. It is an append-only database and has some limited functionality for data deletion added in the later stages of development.

ADX has some ML functionality as well. It has embedded ML models for anomaly detection and forecasting on time-series data. Also, you can export ML models to Azure Data Explorer for scoring data.

From a data locking perspective, ADX is most restrictive compared to the above solutions, but again it is not hard to export data and there are available integrations such as ADF or Spark connector.

ADX is great for real-time and data analytics and exploration. It has a convenient query language that is easy to learn. It outperforms competitions such as ElasticSearch and is widely adopted. ADX is a great option when you have a “write once, read many” use case and need to do interactive analytics. It is complementary to Synapse and Databricks as shows architecture below.

Big Data analytics with ADX

Further reading

I have selected a list of architectures provided by Azure involving Synapse, Databricks, and ADX:

Also Azure provides a great comparison page for choosing analytical data stores.

--

--