Meet Snowflake on Azure

Andrei Babich
Rock Your Data
Published in
10 min readMay 30, 2019
Snowflake on Azure

Organizations can now experience Snowflake’s cloud-built architecture, instant elasticity, secure data sharing, and per-second-pricing on Microsoft Azure.

To understand why customers choose Cloud for their data is to understand that their focus is on DATA and not on the infrastructure. Customers are focussed on digital transformation of their businesses and they don’t want to spend time on managing their infrastructure. And so having a managed service in the Cloud is one of the key value propositions that customers are looking for. In other words, not just being in the Cloud but having it actively managed.

Moreover, customers are looking to scale their storage and compute elastically by taking advantage of the cloud capability to add storage as needed instead of having to go through the procurement of new storage area arrays at time when they need them. And to be able to grow with the data growth. As we know customers do not know which data is relevant and which is irrelevant and they try to collect and store a lot of data which obviously increases cost over time.

To have an elastic and cost-efficient storage in the Cloud is a key value proposition that customers are looking for.

Additionally customers are looking to be able to analyze the data in time that is relevant to their business processes. This means that they want to be able to scale the compute and storage capacity when they need faster processing or to pause it or scale it down as necessary and not to pay extra.

The next pattern that we see is with customers modernizing data platforms. As on-premises data warehouse appliances are coming of age they need to be refreshed. Many customers who anyways have a goal of moving their data centres into the Cloud completely are looking for Cloud native solution for their data warehouses. Some customers are keeping their data on-premises for the foreseeable future but they are still using Cloud. They are using Cloud for hybrid scenarios where the main data set is on-premises, but they are feeding that data into the Cloud creating Data Marts. These Data Marts are isolated from their on-premises systems that might already be overloaded. These customers are looking for ways to give their data science teams ways to run reports and do advanced analytics without putting additional stress on overloaded hardware on-premises.

Other customers might need databases consolidation. They maybe over the years had a lot of databases, maybe purchased companies and each one had a different infrastructure and different type of software system running. They oftentimes are trying to bring it all together as much as possible and bring it into the Cloud at the same time, bring it into a scalable system that can take all of that and run it as one holistic system.

Finally the most important part of the scenario is building the new advanced analytics applications. The whole goal of this is digital transformation — customers are looking for additional value to their businesses that are not just engineering work or improvements to their IT. They are looking to build applications that will enable them to drive new value. Many of these applications such as SasS services are Cloud-born applications with data being generated in the Cloud in a variety of formats already. Not just structured data but some unstructured data. Data form applications running on mobile devices, sensors such as IoT. Customers are looking for ways to take all of that data and process it together. But sometimes the existing systems on-premises might not be capable of doing that. And other times it might not make sense to bring the data from the Cloud back on-premise. So using Cloud native data warehouses is a solution that solves that problem.

Finally one of the most important parts is applying machine learning and more advanced analytics to that data which may be running inside of the database engine but oftentimes outside of it. Let’s consider some specific integrations that already exist with Snowflake Service on Azure and Azure’s own services.

The integral part of Snowflake Cloud native architecture is storing data on elastic and massively scalable storage that is durable, that is already replicated multiple times in a Cloud and that does not disappear. And that is Azure Blob storage. That is what Snowflake uses as its internal storage engine but that same data can be loaded from your own storage accounts as well.

Today you can load data from your Azure Blob storage account and in the future you will be able to load data from Azure Data Lake Generation II after interoperability with the existing Blob storage API becomes available.

Today Blob Storage data can be loaded into Snowflake in an efficient manner in a very high performance. Instead of going through one connection you are loading in parallel. Snowflake has a standard SQL ODBC connector allowing you to be invoked from Azure Data Factory. Today by running a self hosted integration runtime on Azure, which is a very simple installation, or in a serverless manner via Azure function activities that can be invoked by the Data Factory pipeline.

You can also use Power BI Desktop to connect directly to Snowflake and upload those reports into Power BI Server to share with your team and the reports will be refreshed either directly in Power BI or on schedule via Data Gate VM running in a Cloud. Snowflake provides a Spark connector that can be used from Azure Databricks service or Azure HDInsight service. From both of these platforms importing the Spark connector allows Spark applications written in Perl or Python to be able to store the data in and to pull data out of Snowflake.

As many customers are moving from on-premise to the Cloud they are looking for efficient ways to move a lot of data. Multiple options exist for that. One option is to copy the data into Azure Blob Storage using efficient transfer tools that are available from Azure such as Data Factory, EasyCopy and other tools. In addition it is possible to ship data from on-premise in rugged and encrypted secure devices that are called Azure Data box. Those devices will arrive in Azure Data Centres and the data will be uploaded to storage accounts which can then be loaded easily into the Snowflake.

Snowflake also provides integration with Azure Active Directory through federated authentication SSO allowing user credentials not to be stored in two locations. Integration with many other services is currently possible and easily achievable via Azure Blob Storage as the hub for staging and source destination of that data. A lot of work is ongoing by both companies to make these integrations even more efficient and convenient for joint customers.

Figure 1: Snowflake Data Warehouse on Azure

If you look from the left side of figure 1 you will see the common data sources for modern organization. It includes relational databases, social media networks, mobile applications, clickstream data and so on. Moreover, modern organizations consuming data from sources such as IoT, Telemetry, Images and Video.

Azure Data Factory which is a managed data integration and orchestration service allows to automate data movement and transformation with dozens of data source connectors built-in. The data can be moved from on-premise sources or from other databases in a Cloud or on-premise and land it in a scalable Blob Storage or an Azure Data Lake storage destination. In future both of these will be accessible via Blob APIs and can be loaded into Snowflake. At the same time, you may leverage Snowflake partner ecosystem and load data with native Cloud ETL Matillion ETL.

Azure Data Storage Gen 2 provides hierarchical names with capabilities above what an object storage does but under the surface they both use the same Blob Storage infrastructure in Azure Cloud.

Once the data lands in the Cloud it can be left in its raw form in a cost efficient tiered manner or it can be processed by all of the Spark engines or all of the machine learning applications that you can have running in Azure on top of Azure Databricks service, within Spark or within other applications in a Cloud or on-premise. The data could be aggregated, processed and loaded efficiently — only the subsets that you care about for your BI insights and reports — into Snowflake on Azure in a very scalable manner. You can load all of the raw data into Snowflake as well if you choose so but loading subsets of data allows you to use different levels of storage for different types of data making it more cost efficient depending on your use case scenario.

Once the data is in Snowflake it can be accessed from Power BI service reports via Data Gateway that is running on the virtual machine in the Cloud making it completely seamless. Moreover, you may access Snowflake with popular BI tools like Tableau that is using Snowflake ODBC driver.

Figure 2: Streaming Analytics with Snowflake on Azure

On the figure 2 we have reference architecture for Streaming Analytics workflow. Similarly to previous diagram there is source layer on the left. Azure Data Factory continues to be used in this scenario to move data to Azure Blob storage. At the same time unstructured data may be streaming in millions of events per day and that data also needs to be ingested. Azure provides services, mostly managed services such as Azure Event Hub and Azure IoT Hub for ingesting those events and streaming them directly into Spark processing engine or through our managed Azure Streaming event complex processing service. And then this allows you to aggregate and to count by hours, minutes or seconds in various time windows of that data flowing in and store the aggregations in Blob storage. Once the data is in Blob Storage Snowflake has a very unique Snowpipe feature to load this data efficiently without writing code into relevant tables by watching for new files landing in Blob Storage and loading them into tables that were pre-defined. On the right side of the diagram is the business layer for end user access.

Everything that was described in both architecture diagrams with data ingestion and processing — all of it is done for those personas of users who are shown on the right of both diagrams — either the business decision makers looking at the dashboards through BI tool or members of the data science team. All of the data is accessible and is mashed together allowing for new insights, new AI models and new business value.

What else you might use on Microsoft Azure for your Data Solution?

This is a good question. Based on our experience, we can recommend several 3rd party tools that works perfectly with Snowflake and have unique value proposition.

Cloud Native ETL for Snowflake

For Data Integration, we highly recommend Matillion ETL as a robust cloud native integration tool. It has many benefits and has the highest rank among users. There are some key benefits of Matillion ETL:

  • Platform-Native — It was designed and built it specifically for the cloud data warehouses, supports — Amazon Redshift, Google BigQuery, and Snowflake — so that our push-down ELT architecture can take full advantage of their raw computing power.
  • Cloud-Native — Matillion utilizes the near-infinite storage capacity of the cloud — meaning your projects get near-infinite scalability.
  • Simple — By working in the cloud, Matillion reduce the complexity involved in moving large amounts of data.
  • Fast — Process a billion rows of data in fifteen minutes — and go from launch to live in just five.
Image result for matillion etl
Matillion ETL

The 1st cloud data modelling tool for Snowflake

Another great product is SqlDBM. With SqlDBM you can improve your productivity as you develop and model your Snowflake DW. And of course, you can take advantage of and adapt SqlDBM features into your Snowflake projects such as Subject Areas, Relationships, Diagrams, View Modes, Revision history, Team Collaboration etc.

SqlDBM — Cloud Data Modelling tool for Snowflake

Visual Analytics Platfrom

The final tool is Tableau Software. This is a Visual Analytics Platform, It allows us to step out of the box and look at data in a totally different way. It is available as Tableau Online or you can deploy Tableau Server on Azure.

Tableau Dashboard

Tableau and Microsoft Azure provide a comprehensive business intelligence solution that can be implemented quickly, secured easily, and used by everyone in the enterprise. Learn how to configure Tableau Server for hosting on Azure to help your organization see, understand, and share data so you can fully realize your investments in Azure and the Microsoft Data Platform.

About Rock Your Data

Rock Your Data is a consulting and technology firm that delivers secure and scalable cloud analytics solutions for large and medium-sized enterprises in Canada.

Rock Your Data help organizations to make distinctive, lasting, and substantial improvements in their performance by leveraging their data and cutting-edge technology.

--

--