Tales from the Fabric — Chapter 4

Breaking down the External Table walls

Riccardo Perico
Rik in a Data Journey
3 min readApr 6, 2024

--

Prelude

In this blog post series we’re telling tales from a POC with the aim to understand if a migration from Synapse to Fabric is feasible, which could be the impact in terms of effort and money and which are the pitfalls to avoid or benefits we get.

The following index will be updated any time a new chapter is released.

External Table heritage

With the raise of data lakes it’s becoming common more and more to interact with data stored in this kind of structures (ADLS, S3 buckets…).

Microsoft added to SQL Server a feature that gives the possibility to call external data sources like files in lakes directly from a SQL Server instance; this feature is called External Table.

The SQL Server instance hosts only External Table’s metadata besides data are physically stored in another location. The following script for example creates an external table called “Region”, with 3 fields, but the data feeding this table are located in an S3 bucket in Parquet format.

CREATE EXTERNAL DATA SOURCE s3_ds
WITH
(
LOCATION = 's3://<ip_address>:<port>/'
, CREDENTIAL = s3_dc
);
GO

CREATE EXTERNAL FILE FORMAT ParquetFileFormat WITH(FORMAT_TYPE = PARQUET);
GO

CREATE EXTERNAL TABLE Region
(
r_regionkey BIGINT,
r_name CHAR(25),
r_comment VARCHAR(152)
)
WITH (LOCATION = '/region/', DATA_SOURCE = 's3_ds',
FILE_FORMAT = ParquetFileFormat);
GO

Going deep into External Tables is out of scope of this arcticle for any further details please refers to offical docs.

This feature is available also in PDW and Synapse Analytics (both Serveless SQL Pool and Dedicated SQL Pool) and we used it extensively for our customer projects to seamlessly connect to data hosted on ADLS, loading them in Dedicated Pool for further cleansing and enrichment.

The raise of “The Lake”

With the advent of Fabric your data should “flow” (not necessarily to reside) through OneLake and you could achieve this in multiple ways:

  • Instructing data producers to generate data in OneLake in Delta format
  • Instructing data producers to generate data in OneLake and to manually convert them in Delta format
  • Instructing data producers to generate data in OneLake and then building pipelines to load them into Delta format
  • Shortcutting Delta tables residing somewhere (you can use API in preview)

See chapter three for more details.

The natural replacement for Dedicated SQL Pool in Fabric is Warehouse workload, which is the best of both Synapse SQL pools. Warehouse workload is indeed serverless and “always on” but gives you the possibility to create database and load data in.

From a Warehouse you can read Delta Tables loaded in a Lakehouse with a 3-parties FORM statement.

SELECT *
FROM MyLakehouseName.dbo.MyDeltaTableName

SELECT *
FROM MyLakehouseName.dbo.MyShortcut

In this way, Lakehouse completely abstracts the underlying storage and you don’t need anymore to create External Tables in the relational engine to interact with data lake. Interaction with that layer happens seamlessly as it would be another database in same SQL Server instance.

Conclusions

In our use case we could eliminate the need of a standalone ADLS instances instructing data producers to load data directly in OneLake in CSV format, and in Delta if suitable. CSVs are uploaded in Warehouse in Delta format using data pipelines and shortcuts are used to bridge already prepared Delta Tables and the Warehouse.

This approach removes the dependencies on External Table technology, removes a resource from the pitch (ADLS standalone is not needed anymore) and pushes on the data governance, centralizing raw data and prepared data in one single place, the OneLake.

--

--

Riccardo Perico
Rik in a Data Journey

BI & Power BI Engineer since 2010 — Data and me in a complicated relationship — Hard Rock and Heavy Metal addicted