How to build a data platform with Azure and Snowflake

Published in

Engineered @ Publicis Sapient

7 min readMay 19, 2020

Big Data is changing the way enterprises manage their systems of insight. They are helping monitor business performance and generate actionable intelligence using an ever-increasing amount of structured and unstructured data pouring in from traditional OLTP systems, web and log analytics, PoS systems and IoT devices. This requires building a data platform that can store data securely and provide analytics which support both curated insights and the ability to explore for hidden correlations and patterns in the data.

As the data and processing needs grow, such a platform needs to be able to grow storage and compute elastically and economically. Cloud-based data services provide the ideal answer for this. Azure supports a lot of options for data processing and storage including databricks, which is a collaborative Apache Spark-based analytics platform, HD insights, which is a fully managed cloud Hadoop and Spark service and Azure Synapse Analytics, a fully managed elastic data warehouse and Snowflake (a partner of Azure), which provides a hosted elastic warehouse solution. The solution I outline below uses Azure Blob Storage, Azure Data Factory, Azure Functions and Snowflake to build a data platform.

Here is a reference architecture for batch processing:

Let’s understand each of these components.

Sources: There are different sources of data which have both structured and semi-structured content, e.g. FTP locations, REST APIs for various internal and external applications, Cloud Storage buckets, Applications like Salesforce, Databases like Azure SQL and MySQL, etc.

Data Acquisition and Orchestration: Azure Data Factory is a data integration service that allows you to create workflows to orchestrate and automate the movement and transformation of data. This has been used to build all the data pipelines in this solution. It comes with a number of ready-made connectors, e.g. to connector to SQL sources, BLOB locations, etc. Ready-made connector reduces the build time significantly, and with simple graphical user interface, a large part of the pipelines can be set up. In cases where connectors are not available or do not support the use case, we use Azure Function to connect to the source to extract data and load it into Azure blob. Azure Function is the FaaS offering from Azure that provides an event-driven, compute-on-demand experience for APIs and functions that help manage specific tasks. As an example, currently no direct connector for Snowflake is available, so instead Azure Function is used to connect to Snowflake and execute data load/transformation tasks.

Data Storage and Processing: Azure Blob Storage is an object storage solution that allows for the storage of a massive amount of unstructured data. This is used to store all the incoming data as cvs/ json/ parquet files, etc. in the raw layer of the application. Snowflake is a highly elastic and performant, low maintenance, pay-as-you-go Cloud-hosted data warehousing solution. It is used to store all processed data. First, the raw data from blob is loaded into Snowflake staging zone and then data across multiple sources of raw data are cross-referenced and transformed to create the curated layer and finally, Snowflake views are used to create the analytical layer. Snowflake has a unique architecture that allows for storage and compute to scale independently. Also, it has been designed to be very low-maintenance, so as a developer or administrator, you do not have to work about partitions, indexes etc.; all that is automatically managed by the Snowflake. It is highly performant and has the ability to store and query unstructured data in formats like JSON, Parquet, etc. It can load large files (GBs of data, millions of records) with sub-second performance. Also, it uses well-known languages — SQL for query and JavaScript for stored procedures, making it easy to use.

Let’s dive deeper into a sample pipeline that pulls data from MySQL database and loads it into Snowflake.

The pipeline starts by query Snowflake for job metadata (like last incremental load parameters) and registers a new job run. Then, data is pulled from MySQL using out-of-the-box ADF SQL connector using the incremental load criteria and put into the blob storage for future processing. To load data in Snowflake, we create a Snowflake external stage that points to the blob storage. And then the data is copied using this stage into a standardized table in Snowflake. During this process, data is merged in case of updates/data changes. A generic meta data-driven stored procedure was created to handle this data load for any source. After data for multiple sources/entities has been loaded into the staging area, data is loaded in the Snowflake curated zone. In the curated data zone depending on requirements, some data follows SCD2 (slowly changing data) pattern and for others, just the latest version of data is maintained. For SCD2 type data, a checksum implementation was done to identify changes using SHA1 as the hashing function. The curated layer combines data from different sources/tables to create business domain-based views of that data set. All this data is used to create aggregate/analytical views using Snowflake views. Finally, the job is marked as completed (or failed with error details) in the job run tables in Snowflake. Below is a simplified view of this. In reality, data pipelines are designed to load groups of data together from source and then, based on data dependencies, jobs are run per curated table to load the required data.

Some learnings/tips to keep in mind:

1. Azure Data Factory has a number of ready-made connectors. Use them to save time and effort.

2. Azure functions with HTTP triggers have 230 second timeout. So if you need longer running methods, use durable functions instead. Note: durable functions are currently only supported in .NET. Also, configure your function host for this timeout to avoid worker threads causing any issues.

3. Handle transient errors using inbuilt retry mechanism in Azure data factory activities like calling an Azure Function and all Azure Blob activities.

4. Wherever possible, use managed identity for authentication, so that you do not have to manage credentials.

5. Ensure that you have an incremental load strategy to manage data volumes.

6. ADF will throw errors for non-existent files, leading to pipeline failures when there is no data to process. Use metadata activities to handle such cases.

7. Unlike a try, catch, finally concept ADF only supports success, failure and completion actions per step, not for a group of steps. Therefore, if you want to write a common activity that is invoked on failure of activity1 or activity2 or activity3, that is not yet supported. You will need to set up three different failure activities, one for each of the original activities. Or you can package all the original activities in a pipeline and then add a failure task on that pipeline.

8. ADF rest connector works for a number of use cases. However, it has some limitations in terms of support for pagination because of which you may need to use Azure Functions.

9. Azure Functions triggered by ADF should always send back a JSON response, in case of success or failure.

10. Snowflake storage and compute can scale independently.

11. Snowflake reduces administration to almost zero, reducing maintenance effort significantly.

12. Identify data where change history is important, so that you can design your data processing and storage accordingly.

13. Snowflake allows the creation of constraints but does not enforce them. This gives you flexibility in managing data load and processing. However, it requires handling for any potential data quality challenges.

14. Handling time zones in Snowflake requires special attention. If you need to store data in UTC, it is simple. If you need to store data with timezone info, then Snowflake only supports storing the offset from UTC and not the timezone name. If the source provided you data without timezone offset detail, you will need to take extra care to calculate the offset handling DST before saving the offset in the database.

15. Snowflake has an out-of-the-box time travel feature. This helps you to look at data as of a past date. This helps analyze changed or deleted data and to be able to look at its state as of a certain time in the past, making analysis and debugging easier.

16. Snowflake provides a very useful data sharing feature which can help share data securely with business partners. This can create a great ecosystem in which data is shared instantaneously without the need for any special connectors or ETL tools.

17. Snowflake Database Replication ensures data durability and availability across regions and different Cloud providers instantaneously, ensuring no downtime.

18. Clone is a very useful feature in Snowflake. Let’s say, you encountered a problem in production and want to replicate and debug the issue in the development. Using this, you can create a clone of the database, schema or table, and get it almost instantaneously in your dev environment. Snowflake uses its metadata repository to manage cloning and does not need to spend time physically copying the data or paying for all that extra storage. That is why, this is also called zero-copy cloning. Only when data is modified, changed data sets are stored separately. It becomes even more powerful when combined with time travel. One can create a clone of a certain time or clone to the state before a particular query was run etc. using cloning with time travel.

Overall, Azure and Snowflake provide a very powerful tool set to quickly build data platforms. They help dynamically scale the platform as processing or storage needs evolve and provide a great tool set for developers and operations teams to manage them.

Here are some other useful links:

1. Snowflake Architecture — Learn How Snowflake Stores Table data

2. How to load data in snowflake from Azure

3. SCD Type2 with snowflake

How to build a data platform with Azure and Snowflake

Written by Aditya Gandhi