PolyBase in SQL Server — A Dark Horse ETL Solution?

Rohit Bathija
Analytics Vidhya
Published in
4 min readApr 1, 2020

As of 2020, there are thousands of organizations and Microsoft shops running on SQL Server, using a variety of ETL tools like SSIS (SQL Server Integration Services) and Azure Data Factory for Extracting, Transforming & Loading data to and from various disparate data sources. Here, we will look at how PolyBase can be a one stop shop for all your ETL needs, as it often goes under the radar.

What is PolyBase?

PolyBase allows SQL Server instances to easily access external data sources, like a linked server or Db link on steroids. This feature was first introduces in SQL Server 2016 which allowed access to Hadoop/HDFS and Azure Blob Storage.

With the release of SQL Server 2019 — we can also access external data stored in SQL Server, Oracle, Teradata and MongoDB in addition to Hadoop and Azure Blob Storage. It also extends to products under Microsoft Data Platform namely PDW (Analytics Platform) and Azure Synapse Analytics (previously called Azure SQL Data Warehouse).

Hold up! Isn’t PolyBase the same as a linked server?

Not quite, while linked servers have been around forever — PolyBase is a more fleshed out feature to enable analytics by fetching data from external sources. A few differences between PolyBase and Linked servers are:

PolyBase is Database scoped wherein you are effectively querying data like you would from any other table within the Db and can use scale out features effectively. On the other hand Linked servers are instance scoped where the host Db effectively makes a single-threaded call to an external Db. (Side note: Oracle Db link actually does a better job here by supporting multi-threaded calls)

PolyBase used ODBC connectivity while Linked servers in SQL Server use OLEDB. Microsoft has a full fledged FAQ here.

Why use SSIS and Azure Data Factory instead of PolyBase?

While PolyBase does make your SQL Server a one-stop shop for accessing external data sources, there is a still a need to use full fledged data integration tools available in the Microsoft ecosystem i.e. Azure Data Factory and SSIS:

  • Latency and performance issues — Both SSIS and Azure Data Factory manage lrage workloads very well and allow for fine grained control to improve avoid latency and performance roadblocks
  • Fancy drag and drop UI — Most of the groundwork and configuration for using PolyBase is done by SQL queries, while both Azure Data Factory and SSIS offer intuitive interfaces for building powerful ETL pipelines

Its fair to say PolyBase isn’t meant to fully replace your ETL pipelines.

So, when and how do we actually use PolyBase?

In my experience using all of these tools, PolyBase is more suited for an ODS system and staging, QA or test environments where rapid prototyping and exchange of data is needed between multiple data sources OR for importing a fixed amount of data for analytical purposes.

Another strong use case is exporting old data from your Db or Data warehouse for archival to say Azure Blob Storage.

Installing and using PolyBase

Feature selection while installing SQL Server

Select the check box for PolyBase during installation of SQL Server and most importantly, we need to enable the feature by running the following query:

exec sp_configure @configname = 'polybase enabled', @configvalue = 1;
RECONFIGURE

You also need to restart the SQL Server engine from services.msc

On Linux (Ububtu):

Install PolyBase and restart SQL Server:

sudo apt-get remove mssql-server-polybase
sudo apt-get update
sudo apt-get install mssql-server-polybase
sudo systemctl restart mssql-server

Adding external data sources and running queries

You can add external data sources via SSMS (SQL Server Management Studio) via the wizard and view tables under External tables and Data Sources under External Resources.

Configuring external data sources

Key steps to configure external data sources include:

  1. Creating a master key for the secret & the database scoped credential
  2. Create external data source (Hadoop, Oracle, Azure Blob Storage, etc.)
  3. Create external data format (CSV, parquet, etc.
  4. Create external table
  5. Create statistics and indexes (clustered, non-clustered or column store)

The above steps for adding external data sources and tables, say to import and export data to and from Azure Blob Storage — a sample use case case is embedded in the GitHub repo below:

PolyBase-SQLServer-AzureBlobStorage

Configuring other data sources like Oracle and Hadoop can be found here.

As we have seen, PolyBase can and should be used in tandem with traditional ETL tools to enjoy the best of both worlds.

--

--

Rohit Bathija
Analytics Vidhya

Technology & Data Engineering Consultant. Avid traveler.