Snowflake Cortex ML + Matillion = Low Code ML + ELT = new skill unlocked for data teams
Snowflake’s Cortex ML capabilities are low-code SQL functions that empower data engineers, who traditionally dont have the depth in statistical or ML building and ops knowledge, to generate ML-based insights out of their data, such as forecasting, anomaly detection, classification and more. Snowflake provides the appropriate ML algorithm for each function, so you don’t have to be a machine learning developer to take advantage of them. All you need is your data.
So what happens when you combine the power of these engineer friendly SQL functions, with a code-optional data pipeline platform like Matillion Data Productivity Cloud (DPC)? This blog explores it, so read on!
But wait… what is the Matillion Data Productivity Cloud?
The goal of the platform is to make data teams more productive, utilizing a single application for any Extract, Load, and Transform process. In addition to being a GUI-based ELT Platform, Data Productivity Cloud offers data engineer-friendly features such as:
- Component-based Orchestration, allowing for a visual component of the end-to-end pipeline
- Notification capabilities like Send Email, Webhook Posting, and the posting of messages to AWS SQS queues.
- High-code capabilities such as the SQL Script, Pushdown Python, and Commands for dbt Core
- Team collaboration and version control tooling using Git
- Integration of AI functionality like Auto-Generated Documentation, Amazon Bedrock Prompt, and OpenAI prompt
Data Productivity Cloud utilizes pushdown SQL to make this whole process work, which makes it a natural fit for users looking for a platform to incorporate Cortex ML functions. Components are the tools in the Data Productivity Cloud users’ toolkit; with them, users can define a step-by-step process of potentially loading data from a variety of sources, transforming data from a raw state to production-readiness, then creating and calling native Cortex functions to enable Machine Learning capabilities.
Cortex ML + Matillion DPC in action
The quickest way to get started and see the joint benefits is by following along with the existing quickstart guide about Cortex ML functions. The guide covers a fictional company Tasty Bytes and the build-out of predictive and anomaly detection capabilities on the company’s sales data. This article will follow the quickstart as a team developing a Data Productivity Cloud pipeline.
Section 2 of the guide lays out how users are to set up their Snowflake environment. Please complete Section 2 of the guide which will achieve:
- Access to the Tasty Bytes dataset, found by searching for the Cybersyn Government Essentials dataset on Snowflake Marketplace
- Creation of a warehouse (quickstart_wh) and database objects (quickstart.ml_functions.tasty_byte_sales) which will be used in Matillion.
Once Section 2 of the Quickstart is complete, proceed to the Getting Started with Matillion section.
0. Getting Started with Matillion
Any journey in Data Productivity Cloud can be started by first registering an account from the Matillion Hub. All that is needed is a valid email address and password to sign up. Upon signing in as the created user, the account will be on a 14-day, 500 credit trial.
Once registered for the Matillion Hub, enter the Designer and create a project. We will name ours CortexQuickstart. When setting up the project, an Environment will also be created which represents the connection between Matillion and Snowflake. Set the following defaults in how Matillion will interact with Snowflake:
- ROLE: ACCOUNTADMIN
- WAREHOUSE: QUICKSTART_WH
- DATABASE: QUICKSTART
- SCHEMA: ML_FUNCTIONS
1. Launching Data Productivity Cloud Designer
The Data Productivity Cloud Designer is where design of the end-to-end pipeline will take place. To begin, let’s prepare the git branch that our team will be working against. Data Productivity Cloud embeds git as a core piece of functionality, to allow teams to collaborate and institute version control.
The Branches screen will communicate that there is one branch in existence — main. Let’s treat this branch as the Production branch, which we will later merge code from a development branch. Speaking of that development branch, create a new one using the Add new branch button and naming it dev.
Once created, the dev branch should now be present on the Branches screen. Click into it to begin designing your pipeline.
The first step of Designer is usually the creation of an Orchestration pipeline. This can be done by clicking the Add button in the Project Explorer and selecting Orchestration Pipeline. Let’s create an Orchestration pipeline and call it CortexOrchestration.
2. Preparing datasets via DPC Transformation Pipeline
Now that our project in Data Productivity Cloud is set up, we will move on to Step 3 in the Cortex Quickstart Guide. Here, the Data Engineering (DE) team creates views and tables based on data derived from the original Tasty Bytes dataset. As opposed to the hand-writing of SQL in the Quickstart guide, Matillion gives users the option of dragging and dropping components which provides better clarity and maintainability of the pipeline.
The DE team creates a Transformation pipeline and drags out as the first step in the Orchestration pipeline.
Within the Transformation pipeline, three different datasets will be created, all minor derivations from the original Tasty Bytes dataset. They all include only the latest years’ worth of data and the date field also needs a conversion to a specific kind of timestamp.
The DE team will complete this step of the pipeline with four components:
- Table Input to read the TASTY_BYTE_SALES table
- Filter to filter the DATE field for only dates greater than May 28, 2022
- Calculator to make the data type conversion
- Rename to retitle the DATE column as TIMESTAMP
The DE team has also made use of the Notes feature, explaining each step of the Transformation pipeline.
Using the Sampling feature in a Matillion pipeline the DE team is happy with the state of the data and number of rows, and is ready to progress with the Transformation pipeline.
The DE team creates three views needed for the Forecast and Anomaly Detection tasks:
- lobster_sales to train the forecasting function; dataset to consist of only Lobster Mac & Cheese items, and only the TIMESTAMP and TOTAL_SOLD fields
- vancouver_anomaly_analysis_set consisting of only the latest months’ worth of data
- vancouver_anomaly_training_set consisting of the latest years’ worth of data, without the last month included
These three views will be created in 3 separate branches and concurrently-running branches. Once complete, the DE team clicks Run Pipeline to complete the creation of the 3 new views derived from the TASTY_BYTES_SALES dataset. This materializes 3 new views in the Snowflake database.
3. Create and Execute Cortex ML Forecast function
Now that the 3 views have been created via a Matillion Transformation pipeline, now is time for the Data Science (DS) team to get to work. The DS team wishes to:
1) create a 10-day Forecast which will calculate the next 10 days worth of Lobster Mac & Cheese sales
2) detect any anomalies in sales for all items sold in the latest month
To do so, the DS team will leverage Cortex ML in Data Productivity Cloud, due to its native SnowSQL pushdown capabilities. Returning back to the CortexOrchestration pipeline, the DS team will set up two SQL components for both flows, and set up the process so it runs concurrently.
For the Forecast Lobster Mac & Cheese flow, one SQL Script component will be used to create the Forecast function, and another to call the function to predict the next 10 days of sales. Results will be saved to a new table called macncheese_predictions. Both components will use the code from Section 3 Step 2 and Step 3 of the Quickstart, respectively.
4. Create and Execute Cortex ML Anomaly Detection function
The DS team also wishes to train and detect anomalies on the past years’ worth of all Vancouver sales. They also wish to do so concurrently, as the Mac & Cheese Forecasting functions run. The code used here will be derived from Section 5 of the Quickstart guide to first create the vancouver_anomaly_model trained on the first 11 months of data, then to call the trained model to detect anomalies on the latest month of sales data. Any anomalies detected will be saved in the vancouver_anomalies table.
Upon completion of the setting of the SQL Script components, the CortexOrchestration pipeline appears as the following. The DS team clicks the Run Pipeline button to initiate the pipeline to execute recreation of the views as well as the Cortex ML processes.
5. Trigger Email Notifications
While the DS team is confident in their pipelines, they do not want to assume everything will run perfectly, especially as new data is introduced to further train the Cortex functions. As the pipelines run on a regular basis, the team wishes to set up a process by which emails will be sent in either the event of success or failure. Data Productivity Cloud contains robust notification methods including the Send Email component which can accomplish this task.
But how to create the logic of either: a) Send Success email detailing the most frequent anomaly, when both the Mac & Cheese Forecast and Anomaly Detection succeed, or b) Send Failure email if either process were to fail?
Luckily for the team, Data Productivity Cloud has And and Or components to be used to make this very evaluation. Using these components, the DS team has successfully created a process by which a Success or Failure email will be dynamically sent based on the performance of the job.
As a final cherry on top, the DS team also wants the success email to communicate the most frequent anomaly. The Query Result to Scalar component can be used to query the vancouver_anomalies table with the following statement, which will map the result to a Matillion variable included in the Success email:
WITH TOP_ANOMALY AS (
SELECT series, is_anomaly, count(is_anomaly) AS num_records
FROM vancouver_anomalies
WHERE is_anomaly =1
GROUP BY ALL
ORDER BY num_records DESC
LIMIT 1)
SELECT CONCAT(SERIES, ‘ : ‘, NUM_RECORDS) AS TOP_ANOMALY_COUNT
FROM TOP_ANOMALY
The resulting Matillion pipeline looks like the following:
Upon running, our team has received the following email which dictates that the process has run successfully, and that Hot Ham & Cheese represents the food item with the most anomalies at 18.
Success! The DE and DS team’s collaboration has led to an end-to-end process of sourcing and building training datasets, to then build and call native Cortex functionality. And as the process will run on a regular basis, the team has also built notification methods to communicate success or failure. There are still a couple more steps the team can take to promote their pipelines to production and run on a schedule.
6. Productionizing the pipeline
The first step of the 1. Launching the Data Productivity Cloud Designer section involved the creation of a new branch that the Data Engineering and Data Science teams were working against. But how can the team: 1) promote this code to a production branch, and 2) ensure that it automatically runs on a somewhat frequent basis?
First step will be to first Commit the changes on the dev branch, then push to remote. Within the Designer UI, the user has Git Action buttons at their disposal; the group will commit the changes to the branch, then push local changes to make the latest update of the pipeline available to other branches.
Next, the team will want to promote the code to the main branch, which serves as the production environment. Returning back to the Branches screen, the team clicks into their Production branch — main — and merges from the dev branch. The changes should immediately take place and the team can now find the pipelines in the Project Explorer.
Finally, the team will place the main branch on a schedule to run at an automated frequency set by the team. Any future runs will be initiated at a frequency directly set up by the team, and the team should now expect an email to be sent at the conclusion of any run.
So, to recap…
Here are some key takeaways from this article:
- Cortex ML functions are a powerful, fully managed service that enables you to quickly generate ML-based insights — all you need is your data
- When used in tandem with a GUI-based ELT product Data Productivity Cloud, these functions allow users to build end to end, orchestrated, data transformation + ML pipelines, in a visual and low-code manner
- Data Productivity Cloud’s developer centric capabilities involving Git-flows, segregation of environments, and easy CI/CD enhances the productivity of data teams
- Data Productivity Cloud allows for the build-out of pipeline logic (And, Or components) and notifications (Send Email), in addition to ELT actions.
How can I replicate this job to my Data Productivity Cloud environment?
- Fork this public Github repository containing the end-to-end pipeline
- Set up a Data Productivity Cloud Project with Advanced settings
- Connect Data Productivity Cloud to the forked repository in the Project setup menu. This article details any prerequisites needed to be completed for the process.
- Here is a FAQ for the connection between Data Productivity Cloud and your Github account.
- Once the project has been set up, you should be able to find the available jobs ready to use.
Authors:
Ripu Jain, Sr Partner Solutions Engineer, Snowflake
Karey Graham, Partner Engineering Manager , Matillion