Introduction to Azure Synapse Analytics

Published in

Betacom

8 min readMar 22, 2021

Introduction

Welcome to the first article of a series covering Azure Synapse Analytics and its features. Today we are going to introduce this new Microsoft service and explain its main properties.

First of all, we will go through some history and terminology. Then, in section three we will discuss Azure Synapse main properties.

History

Today, we are announcing Azure Synapse Analytics, a limitless analytics service, that brings together enterprise data warehousing and Big Data analytics.

In November 2019, Rohan Kumar, the Corporate Vice President at Azure Data, presented Azure Synapse Analytics as an unified experience to ingest, prepare, manage, and serve data for immediate business intelligence and machine learning needs.

It was born as an extension of Azure SQL Data Warehouse (SQL DW) which was first released in 2015 and upgraded to a Generation 2 in 2018.

Terminology

Before exploring Azure Synapse Analytics and its features, we need to understand some terminology. If you are already familiar with it, please go directly to the next session.

An Azure Data Lake Storage (ADLS) Gen2 is a set of capabilities dedicated to big data analytics and is built on Azure Blob storage, Microsoft’s object storage solution for the cloud. An ADLS Gen2 account can be associated with a Synapse workspace to perform cloud-based enterprise analytics in Azure. A workspace also has an associated file system for storing temporary data, is under a resource group and allows you to perform analytics with SQL and Apache spark.

A workspace can contain any number of Linked services, which are connection strings that define the connection information needed for the workspace to connect to external resources.

Synapse SQL is the ability to do T-SQL based analytics in Synapse workspace. Synapse SQL has two consumption models: dedicated and serverless. The first one refers to dedicated SQL pools, which can be unlimited in a workspace. Whereas the second one refers to serverless SQL pools, which is available in every workspace. Inside Synapse Studio, you can work with SQL pools by creating and running SQL scripts.

To use Spark analytics, you can create and use serverless Apache Spark pools in your Synapse workspace. When you start using a Spark pool, the workspaces automatically creates a spark session to handle the resources associated with that session.

Pipelines are how Azure Synapse provides Data Integration, allowing you to move data between services and handle activities. Let’s see some details:

Pipelines are logical groupings of activities that perform a task together.
Activities are defined as actions within a Pipeline to perform on data. Some examples are copying data, running a Notebook or a SQL script.
Data Flows are a specific kind of activity used to transform data through a no-code experience that uses Synapse Spark under-the-covers.
Trigger executes a pipeline. It can be run manually or automatically (schedule, tumbling window or event-based).
Integration dataset is a named view of data belonging to a Linked Service that simply points to or references the data to be used in an activity as input and output.

Architecture

As already stated, Azure Synapse Analytics is a new analytics service that combines data integration, data warehousing, business functions and Big Data analysis in a shared space.

You might wonder why a Data Warehouse (DW) in the Cloud should be a good choice. Let’s see some reasons.

There is no need for large capital expenses.
No hardware maintenance is required.
Thanks to scalability, it is much faster and less expensive to scale a cloud data warehouse than an on-premise system because it doesn’t require purchasing new hardware and the scaling can happen automatically as needed.
The cost is quite low since storage and computing are paid separately and only when needed.
It allows integration with other available services.
Massively Parallel Processing, which will be explained in the next article, increases the speed of queries.
The time to market is shorter than the one required for other DW solutions and there is no concept of a “version” or software upgrade.

Traditional on premises DW architecture used to look like the following picture.

As you can see, there were different data sources in many different formats e.g. relational database, CSV file, excel file, structured or unstructured data. Then, you had to extract, transform and load such data through an ETL process that was typically in memory. This process standardizes, consolidates and cleans the data to finally send them into the DW along with some metadata which describes them. Note that there was also some work going on behind the scene to administer and monitor the database.

Then there were the analytics, where you could query against API servers. An important point to be noted here is that in analytics the server compute and storage are tightly coupled together.

Then you run queries and do analysis again. There will be servers which you finally publish into some kind of reports to business users using Power BI or some other reporting tool.

The modern Data Warehouse architecture can be summarized as follows:

*Source:* *What is dedicated SQL pool (formerly SQL DW)? — Azure Synapse Analytics*

Again the first thing to do is to bring data from a variety of sources. There are a number of services in Azure that can do this. One of them is Data Factory. This service can bring data from different sources and can ingest it into a storage. For storage, more and more customers are using the Data Lake architecture which offers a convenient way to store data and decide what to do with it later.

Then this data can be explored by data scientists or data engineers. So they go and look at this data and then typically they prepare and glean them. Since data could come from different sources and could be structured or unstructured, when we combine them we often have a lot of quality issues. Thus doing data quality is critical in the modern data warehousing system and the majority of time people spend is in cleaning and preparing them.

When data is prepared we can model this data to the business so that business can actually ask questions. That means creating a structural model of data and a single source of truth where business can rely. Then finally there are BI, reporting and advanced analytics build on top of the solution.

Now one very important thing to be noted here is that in a modern DW solution compute and storage are separate. It means they charge separately and they are on only when they are in use. So for storage you only pay for the amount of storage and there is no cost for the storage transaction, whereas the compute power is paid in terms of data warehousing unit (DWU) which is given by CPU, memory and IO.

For example let’s say your DW gets no use during the weekends. Then you can pause it completely and save yourself that cost. Please note that when you pause the compute power, you will still get charge for the storage which is however very cheap.

The architecture pattern described above is a very common way to achieve a modern data warehousing solution and that’s why a lot of customers are moving their data warehousing solution to the cloud.

Now let’s see this architecture more in terms of services.

Source: Modern Data Warehouse Architecture — Azure Solution Ideas

As you probably know, there are many services in Azure and all of them provide a great capability. If we drill into the modern DW solution there is a set of services that we can really bring together.

The first one is Azure Data Factory, a serverless integration product. It’s been in the market for a number of years now and really matured over those years to enable data integration in a hybrid scenario. Data Factory can bring data from on premises, SAS or practically from any type of source. It can handle a massive scale of data, huge velocity and volume of data and it has a lot of components (more than 90 inbuilt plugins). It provides complete control of data flow which is what data pass through life (creation, ingestion, transformation).
Once you brought data in, you can put it in your storage solution to do the data preparation and cleaning. This is a place where data comes together, combines and clean and all kinds of transformation takes place. The Azure Data Lake Storage is the cheapest for storage as there is no real processing going on.
When data is in the Data Lake there are also services used to explore that data. DW itself has a SQL pool based capability that allows you to query data that is in the Data Lake or in the DW itself. Another way could be to use PolyBase to get data directly from Data Lake and store it into a DW. Databricks is another service which we can use to explore the data and supports many languages to query and exploit the data in the Data Lake: Python, Scala, SQL, etc.
Databricks can either write this data into Azure Synapse Analytics or it can also work with your analysis services. So after bringing data into a big data store and doing some preparation and exploration, now finally they’re ready to effectively publish the data. As we discussed, cloud DW separates the compute and storage so you can scale up the compute power when you have more users and more workload and you can scale it down when you have less busy time.

The four services we just presented, in addition to being available individually, are also grouped under the Azure Synapse Analytics Service.

The Azure Synapse Analytics core is the SQL pool (previously Azure SQL Data Warehouse), a massively parallel processing database. We will discuss its architecture in the next article. For now, you should know that there exist dedicated and serverless SQL pools. In the first one, resources are dedicated to your data and are always ready, whether you are using them or not. While in the second one, resources are activated only if necessary and you usually use them for ad hoc operations.

Conclusion

You should now understand Azure Synapse Analytics architecture and main properties. In the next article we will explore the SQL pool architecture and engine.

See you in two weeks! 👋