Hello World to Data Lakehouse with Azure Synapse Analytics

Xin Cheng
Xin Cheng
Jun 18 · 7 min read

Data Lakehouse is a new data architecture paradigm that combines the benefits of data warehouse (transaction support, schema enforcement, BI support) and data lake (storage decoupled from compute, diverse data types, diverse workloads, streaming), and provides additional benefits (direct query on raw data without ETL/ELT). Here are good articles talking about data lakehouse.

Azure and AWS both talk about data lakehouse now. Azure has Azure data lake, Azure Databricks and former Azure SQL DW, but Microsoft tries to provide more integrated experience with Azure Synapse Analytics.

In this post, I am going to experience data lakehouse with Azure Synapse Analytics, especially on

  • Serverless SQL
  • Synapse pipelines
  • Spark pool

Create Azure Synapse workspace

  • In the Azure portal, search for Azure Synapse. In the search results, under Services, select Azure Synapse Analytics.
  • Select Add to create a new workspace.
  • In the Basics tab, give the workspace a unique name. Create a new ADLS Gen2 account in the Basics tab. Under Select Data Lake Storage Gen2, click Create New and choose a name for the account. Under Select Data Lake Storage Gen2, click File System and name it users.

After resource is provisioned, go to Synapse workspace, there will be URL for you to access workspace (sth like web.azuresynapse.net). Click it (or Open Synapse Studio) and authenticate.

General UI

Home: home screen

Data: Manage database, dataset, etc.

Develop: Create SQL script, notebook, data flow

Integrate: manage pipeline, access copy data tool

Monitor: pipeline, SQL requests, spark applications

Manage: where you manage analytics pool (SQL, spark), external connections, access control, etc.

Code-free ETL with Azure Synapse Analytics

Let’s copy from Azure Open Dataset to our Data lake storage.

  • Set up a new linked service by going to the Manage hub, clicking on Linked services, then clicking on +New. Type Azure Blob Storage into the search box and select the service.
  • Name your linked service, change “Authentication method” to “SAS URI”, change “SAS URL” to https://azureopendatastorage.blob.core.windows.net/, for “SAS token”, input any charater, e.g. “aa”. Click “Test connection”, you should see “Connection successful”. Click Create.
  • Create a new Pipeline in the Integrate hub and add a Copy data activity. Point the source dataset to the Azure Open Dataset set up in above step2. Select Azure Blob Storage for the data source and Parquet as the format. Set the data source properties in the screenshot.
  • Point the data sink to a Data Lake Storage account. Select Azure Data Lake Storage for the data sink, and Parquet as the format. Fill out the data sink properties similar to those shown in the screenshot
  • Select “Wildcard file path”
  • Publish pipeline and trigger now
  • You can go to Monitor | pipeline runs to check pipeline progress
  • After pipeline finishes, you can go to Data hub | linked | your storage account to use serverless SQL to query

I regard serverless SQL as Amazon Athena query as a service. However, so far, create external table with partitioning seems not straightforward and intuitive. Here is the supported “create external table” with partitioned folder.

vendor_id VARCHAR(100) COLLATE Latin1_General_BIN2,
pickup_datetime DATETIME2,
dropoff_datetime DATETIME2,
passenger_count INT,
trip_distance FLOAT,
fare_amount FLOAT,
tip_amount FLOAT,
tolls_amount FLOAT,
total_amount FLOAT
) WITH (
LOCATION = 'yellow/puYear=*/puMonth=*/*.parquet',
DATA_SOURCE = nyctlc,
FILE_FORMAT = ParquetFormat

You cannot leverage partition elimination. Partitioned views can support, but the syntax just feels non-intuitive.

AS SELECT *, nyc.filepath(1) AS [year], nyc.filepath(2) AS [month]
BULK 'parquet/taxi/year=*/month=*/*.parquet',
DATA_SOURCE = 'sqlondemanddemo',
) AS nyc

Compare to HiveQL supported by Amazon Athena, the following feels more natural.

first string,
last string,
username string
PARTITIONED BY (id string)
STORED AS parquet
LOCATION 's3://bucket/folder/'

The integration with delta lake is a big advantage. Delta lake is open-source storage layer that brings ACID capability to data lake.

Analyze with Spark notebook

  • Create Spark pool, give spark pool name, choose node size between small, medium, large, xlarge, xxlarge, choose “auto scale”, number of nodes. Select “automatic pausing”, spark version
  • Select data, select “new notebook”, “Load to Dataframe”
  • Synapse generates starter code (manually remove path after yellow/ to access all data), select spark pool and execute code.

The open dataset includes 1571671152 records

Direct access to Azure open dataset from Notebook

.NET Spark

.NET spark is catering for Csharp developers. The main example is below

var df = spark.Read().Parquet("abfss://nyctlc@<adls account>.dfs.core.windows.net/yellow/puYear=2001/puMonth=1/");

Output is pretty raw

+--------+--------+------+------+----------+---------------+-----------+----------+-----+------+--------------------+---------+-----------+-----------+ |       2|2001-01-01 22:55:43|2001-01-02 00:05:43|             1|        4.67|         234|         162|    null|    null|  null|  null|         1|              N|          2|      31.5|  0.0|   0.5|                 0.3|      0.0|        0.0|       32.3| |       2|2001-01-06 06:03:30|2001-01-06 14:08:58|             1|         0.0|         193|         193|    null|    null|  null|  null|         1|              N|          2|       2.5|  0.0|   0.5|                 0.3|      0.0|        0.0|        3.3|


var newDF = df.SelectExpr("replace(left(tpepPickupDateTime,7), \"-\",\"_\") as yearMonth", $"tipAmount/totalAmount as tipPct");

Spark UI monitoring

Stop session when done (left to settings button)

This article is based on Azure Synapse Analytics toolkit provided by Microsoft.

Reference architecture

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…