Creating (and displaying!) a Simple Predictive Model in Snowflake — Part 1

Peter Beck
7 min readApr 29, 2024

--

Predictive analytics has numerous real-world applications, but structuring a robust data set, creating a predictive model, and publishing the results to a diverse group of users are distinct steps that can be a bit disjointed, and require different skills.

Doing this with Python using the standard scikitlearn package is fairly straightforward, but Snowflake has wrapped predictive functionality in a simple wizard that can be invoked over the data set to provide a nearly no-code solution. Once we have our model results, we can easily display and share them using a simple Streamlit application.

I recently spent some time creating a simple Snowflake predictive model over a set of publicly available data provided by the Canadian government as part of its Open Government project https://open.canada.ca/en . The data in question gives a daily report of commercial trucks and truck drivers entering or returning to Canada. Essentially, it gives a count, by border crossing (also known as Ports of Entry), of trucks entering Canada from the United States on a daily basis. The data set can be found here: https://open.canada.ca/data/en/dataset/4e414765-29de-4189-b1ed-21de1adb2b33

This data provides a nice set to attempt to build a predictive model using Snowflake. A question we might attempt to answer is: given the data in the data set, how many entries should we expect over the next “x” number of days in the future at crossing “y”? If you are attempting to staff crossings with Border Service Officers, for example, being able to have some visibility into what is expected to happen over the next 30 days could be quite valuable.

Once the predictive model has been run, a Streamlit app can be used to publish the results, which can then be easily shared with other users in your Snowflake account.

Here are the steps to build this simple application:

Prerequisites:

1) Snowflake account. A trial account is fine — as of writing, all the functionality required is available in a trial account. You can get a trial account here: https://signup.snowflake.com/

2) The data we want to analyze, available from the Government of Canada Open Government website. https://open.canada.ca/data/en/dataset/4e414765-29de-4189-b1ed-21de1adb2b33 The data set we are analyzing is called Leading indicator, Trucks and drivers entering or returning to Canada, by vehicle licence plate

3) SnowSQL, installed on your local machine. The data set is about 1GB in size. The maximum file size you can upload to Snowflake through the “Snowsight” web user interface is 250MB, so in our case we are going to use the local command-line SQL interface to issue a COPY command to move the data.

4) Basic knowledge of Snowflake. You should know how to connect to Snowflake from Snowsql, and run basic code in Snowsight.

Goal:

We are going to load the data and and create a predictive model based on data from 2019 to 2023. The model will predict the expected border crossings for trucks for the month of January, 2024. We will then create a simple Streamlit app that will show the predicted results and the actual results for January 2024 for each Port of Entry, to see how accurate our model actually is.

Steps:

1) Make sure you have met the prerequisites above

2) Once you have downloaded the data, unzip it to extract the CSV we want to load. I gave the file a new name so I could keep track of it more easily — I called it Trucks_Entering_Canada.csv

3) Load the data into a table in Snowflake

a. First, we’ll need a Database to load the data into. We will call the Database TRUCKING_DATA:

create or replace database TRUCKING_DATA;

b. To load the data into Snowflake we are going to first create a Stage. A Stage is like a S3 bucket on AWS — basically a place we can load any kind of file for subsequent processing. In a SQL Worksheet in the PUBLIC schema of the TRUCKING_DATA database, execute the following:

create or replace stage TRUCKING_DATA_STAGE

c. Now we need to Put the file from our local computer into the stage, so we can subsequently load it into a table for analysis

i. Connect to Snowflake from Snowsql

ii. issue the following command: put file://C:\[path_to_your_data\ Trucks_Entering_Canada.csv @TRUCKING_DATA_STAGE;

d. Once the file has been loaded, we can see it in the TRUCKING_DATA_STAGE in Snowsight:

e. Now we need to create a table into which we can load the data. The easiest way to do this is to load it through the Snowsight interface.

On the right side of the interface above, select the 3 dots and then select Load into table

Select Create New Table, and then give the table the name TRUCKING_DATA_CROSSINGS:

On the next page, select Delimited Files. Snowflake will attempt to generate a File Format for you. However, you may receive some errors — simply select Autofix Errors from the drop down in red:

…after the fix you should see this:

Click Load. Once the data has loaded, you will have a table with about 6.4 million rows of crossings data:

4. Create a View of the Data

Now that we have the data loaded into the, we want to simplify the data by creating a view.

The data in TRUCKING_DATA_CROSSINGS contains rows which are irrelevant to our analysis. We will create a somewhat aggregated view of the data to enable the creation of a predictive model:

create or replace view TRUCKING_DATA.PUBLIC.VW_TRUCKS_ENTERING_CANADA_2019_2023(
REF_DATE,
GEO,
VALUE
) as
SELECT REF_DATE
,GEO
,SUM(VALUE)
FROM TRUCKING_CROSSINGS_STATS
WHERE GEO <> ‘Canada’
AND UNITS_OF_MEASURE = ‘Trucks’
AND REF_DATE BETWEEN ‘2019–01–01’ AND ‘2023–12–31’
AND VEHICLE_LICENCE_PLATE NOT IN
(‘American-plated trucks entering Canada’,
‘Canadian-plated trucks entering Canada’,
‘Trucks entering Canada’)
GROUP BY REF_DATE,
GEO;

This view will give us a rolled-up view of Trucks entering Canada between 2019 and 2023 for each Port Of Entry, and also a rolled up number of each province that has a Port Of Entry that is a truck crossing (some provinces, like Price Edward Island, don’t have a way for trucks to directly enter Canada, so are excluded).

5. Build a Model.

Now the fun stuff begins. We will use the Snowsight interface to build a simple predictive model to attempt to predict the number of truck crossings that will happen in the first 28 days of January, 2024

In Snowsight, select Cortex from the AI & ML menu:

Next, select Forecasting:

We’ll call the model trucking_model, and for now we can leave Generate evaluation metrics off. We also need to select a warehouse, in this case I select my default Compute_wh:

Click Let’s go

Now select the view we created for our data:

On the next page, select the target column — the data we are trying to predict, in this case Value.

Now we must select a date that represents the individual dates for our training data:

Finally, we select the Series Identifier — in this case, the GEO column, which holds the location of our Ports of Entry, and also rows for individual Provinces.

In the case of our data, there are no additional features, so we can skip the next screen:

We now select the number of days we want to make predictions for after the end of our data. In this case, since our data ends at the end of 2023, our predictions will be for the first 28 days of January, 2024:

Click Done. Snowflake will now generate a script that will create your model:

Select a warehouse and run your script. How long the script will take to run will depend on the size of the warehouse you chose, but if you chose “Xtra Small” expect to wait 10 minutes or so.

At the end, you will have a new table with the name of the table you assigned as the target in the Configure your Predictions step — in this case My_Trucking_Forecast. SELECT over this table to see the results:

We can see that for each day, the model has provided a forecast number of crossings for each location, with an upper and lower bound for the forecast. Given how we configured the model, we would expect 95% of our observations to fall withing the upper and lower bound.

In Part 2, we will create a simple Streamlit application to display the results of our model, and to compare the results of the model to what actually happened in January, 2024.

--

--

Peter Beck

Peter Beck is an analytics and data warehousing specialist based in Ottawa, Canada.