Data Lakehouse is a new data architecture paradigm that combines the benefits of data warehouse (transaction support, schema enforcement, BI support) and data lake (storage decoupled from compute, diverse data types, diverse workloads, streaming), and provides additional benefits (direct query on raw data without ETL/ELT). Here are good articles talking about data lakehouse.
What is a Lakehouse? - The Databricks Blog
Over the past few years at Databricks, we've seen a new data management architecture that emerged independently across…
Data LakeHouse — Paradigm of Decade
Modern Data platform paradigm combining the bests of Data Lake and Data Warehouse.
Azure and AWS both talk about data lakehouse now. Azure has Azure data lake, Azure Databricks and former Azure SQL DW, but Microsoft tries to provide more integrated experience with Azure Synapse Analytics.
Build a Lake House Architecture on AWS | Amazon Web Services
Organizations can gain deeper and richer insights when they bring together all their relevant data of all structures…
In this post, I am going to experience data lakehouse with Azure Synapse Analytics, especially on
- Serverless SQL
- Synapse pipelines
- Spark pool
Create Azure Synapse workspace
Quickstart: create a Synapse workspace - Azure Synapse Analytics
This quickstart describes the steps to create an Azure Synapse workspace by using the Azure portal. Open the Azure…
- In the Azure portal, search for Azure Synapse. In the search results, under Services, select Azure Synapse Analytics.
- Select Add to create a new workspace.
- In the Basics tab, give the workspace a unique name. Create a new ADLS Gen2 account in the Basics tab. Under Select Data Lake Storage Gen2, click Create New and choose a name for the account. Under Select Data Lake Storage Gen2, click File System and name it users.
After resource is provisioned, go to Synapse workspace, there will be URL for you to access workspace (sth like web.azuresynapse.net). Click it (or Open Synapse Studio) and authenticate.
Home: home screen
Data: Manage database, dataset, etc.
Develop: Create SQL script, notebook, data flow
Integrate: manage pipeline, access copy data tool
Monitor: pipeline, SQL requests, spark applications
Manage: where you manage analytics pool (SQL, spark), external connections, access control, etc.
Code-free ETL with Azure Synapse Analytics
Load data into Azure Synapse Analytics - Azure Data Factory
APPLIES TO: Azure Data Factory Azure Synapse Analytics Azure Synapse Analytics is a cloud-based, scale-out database…
Let’s copy from Azure Open Dataset to our Data lake storage.
- Set up a new linked service by going to the Manage hub, clicking on Linked services, then clicking on +New. Type Azure Blob Storage into the search box and select the service.
- Name your linked service, change “Authentication method” to “SAS URI”, change “SAS URL” to https://azureopendatastorage.blob.core.windows.net/, for “SAS token”, input any charater, e.g. “aa”. Click “Test connection”, you should see “Connection successful”. Click Create.
- Create a new Pipeline in the Integrate hub and add a Copy data activity. Point the source dataset to the Azure Open Dataset set up in above step2. Select Azure Blob Storage for the data source and Parquet as the format. Set the data source properties in the screenshot.
- Point the data sink to a Data Lake Storage account. Select Azure Data Lake Storage for the data sink, and Parquet as the format. Fill out the data sink properties similar to those shown in the screenshot
- Select “Wildcard file path”
- Publish pipeline and trigger now
- You can go to Monitor | pipeline runs to check pipeline progress
- After pipeline finishes, you can go to Data hub | linked | your storage account to use serverless SQL to query
I regard serverless SQL as Amazon Athena query as a service. However, so far, create external table with partitioning seems not straightforward and intuitive. Here is the supported “create external table” with partitioned folder.
CREATE EXTERNAL TABLE Taxi (
vendor_id VARCHAR(100) COLLATE Latin1_General_BIN2,
) WITH (
LOCATION = 'yellow/puYear=*/puMonth=*/*.parquet',
DATA_SOURCE = nyctlc,
FILE_FORMAT = ParquetFormat
You cannot leverage partition elimination. Partitioned views can support, but the syntax just feels non-intuitive.
CREATE VIEW TaxiView
AS SELECT *, nyc.filepath(1) AS [year], nyc.filepath(2) AS [month]
DATA_SOURCE = 'sqlondemanddemo',
) AS nyc
Compare to HiveQL supported by Amazon Athena, the following feels more natural.
CREATE EXTERNAL TABLE users (
PARTITIONED BY (id string)
STORED AS parquet
Create and use external tables in Synapse SQL pool - Azure Synapse Analytics
In this section, you'll learn how to create and use native external tables in Synapse SQL pools. Native external tables…
Create and use views in serverless SQL pool - Azure Synapse Analytics
In this section, you'll learn how to create and use views to wrap serverless SQL pool queries. Views will allow you to…
The integration with delta lake is a big advantage. Delta lake is open-source storage layer that brings ACID capability to data lake.
On transactional data lake
With big data technologies maturing, big data processing is becoming a norm. Data lake is already adopted by lots of…
Query Delta Lake format using serverless SQL pool (preview) - Azure Synapse Analytics
In this article, you'll learn how to write a query using serverless Synapse SQL pool to read Apache Delta Lake files…
Analyze with Spark notebook
- Create Spark pool, give spark pool name, choose node size between small, medium, large, xlarge, xxlarge, choose “auto scale”, number of nodes. Select “automatic pausing”, spark version
Quickstart: Create a serverless Apache Spark pool using Synapse Studio - Azure Synapse Analytics
Azure Synapse Analytics offers various analytics engines to help you ingest, transform, model, analyze, and serve your…
- Select data, select “new notebook”, “Load to Dataframe”
- Synapse generates starter code (manually remove path after yellow/ to access all data), select spark pool and execute code.
The open dataset includes 1571671152 records
Direct access to Azure open dataset from Notebook
.NET spark is catering for Csharp developers. The main example is below
NET for Apache® Spark™ makes Apache Spark™ easily accessible to .NET developers. - dotnet/spark
var df = spark.Read().Parquet("abfss://nyctlc@<adls account>.dfs.core.windows.net/yellow/puYear=2001/puMonth=1/");
Output is pretty raw
+--------+--------+------+------+----------+---------------+-----------+----------+-----+------+--------------------+---------+-----------+-----------+ | 2|2001-01-01 22:55:43|2001-01-02 00:05:43| 1| 4.67| 234| 162| null| null| null| null| 1| N| 2| 31.5| 0.0| 0.5| 0.3| 0.0| 0.0| 32.3| | 2|2001-01-06 06:03:30|2001-01-06 14:08:58| 1| 0.0| 193| 193| null| null| null| null| 1| N| 2| 2.5| 0.0| 0.5| 0.3| 0.0| 0.0| 3.3|
var newDF = df.SelectExpr("replace(left(tpepPickupDateTime,7), \"-\",\"_\") as yearMonth", $"tipAmount/totalAmount as tipPct");
Spark UI monitoring
Stop session when done (left to settings button)
This article is based on Azure Synapse Analytics toolkit provided by Microsoft.
Azure Synapse Analytics Toolkit
Get everything that you need to start using Azure Synapse Analytics in just a few minutes-and start gaining valuable…