Snowflake
Published in

Snowflake

Easiest ARIMA predictions in Snowflake with a Python UDF

Let’s find out the easiest way to get ARIMA predictions in Snowflake with a Python UDF — available right now in Public Preview. Also find here why December 2025 will be a significant date for Snowflake’s presence in Stack Overflow.

Past & Future projections with naive ARIMA(1,1,1): Quarterly Stack Overflow pageviews for BigQuery, Redshift, Snowflake, Databricks, and Synapse.

2022–09 update

Facebook Prophet now runs inside Snowflake — check the post for more.

Running pmdarima

Let’s get straight to the point. The following is a Python UDF that receives an array of numbers, and returns an array of predictions — as forecasted by ARIMA provided in the package pmdarima:

create or replace function simple_arima(data array, predictions int)
returns variant
language python
runtime_version = 3.8
packages = ('pmdarima')
handler = 'udf'
as $$
from pmdarima.arima import ARIMA
def udf(data, predictions):
return ARIMA(
order=(1,1,1)
).fit(y=data).predict(predictions).tolist()
$$;

That’s it — and this is available right now in Public Preview for all Snowflake accounts. Let’s test it by asking for 5 forecasted points to a simple series:

select simple_arima([1,2,3,4,5], 5);

It’s a naive approach, but it works:

Simple ARIMA over a 1,2,3,4,5 series.

Alternative approaches

Before we look at the caveats of the above method, let’s review some alternatives to the above Python UDF:

  • Facebook Prophet as an external function: Before having Python UDFs available in Snowflake, I wrote how to connect Prophet as an external function. Check my previous post.
Michel Adar presentation about ML in Snowflake at Summit 2022

Naive ARIMA caveats

In the proposed UDF you can see that I simply called ARIMA(order=(1,1,1)). In that call (1,1,1) are the ARIMA (p,d,q)parameters that deserve way more attention than I’ve given them here. Finding out the ideal values is an art, as the seasonal_order (P,D,Q,s) that we totally ignored here.

The package pmdarima also offers the class AutoARIMA that automates the process of finding the best parameters for an ARIMA model. No time to check these important details in this post, but it deserves your attention as you go deeper into ARIMA and forecasting.

AutoARIMA docs in pmdarima

Let’s play with predictions: Snowflake in Stack Overflow

I love Stack Overflow, and tracking how many pageviews Snowflake is getting — compared to other products in the space. Coincidentally this quarter Snowflake finally caught up to Redshift, as they are both getting an almost identical number of pageviews per quarter in Stack Overflow. We can see Databricks slightly behind, and Azure Synapse at the bottom, as visualized in Snowflake’s Snowsight:

Quarterly Stack Overflow pageviews for BigQuery, Redshift, Snowflake, Databricks, and Synapse since 2017

I’m not surprised that BigQuery has way more quarterly pageviews than the rest — I was part of that team for 8 years. Since those days I’ve loved seeing Snowflake grow (I switched teams almost 2 years ago). I’m proud of my work with BigQuery, and I’m personally invested in making Snowflake the best alternative for you — now, and in the future.

So the natural question is: When will Snowflake catch up to the quarterly pageviews that BigQuery gets? BigQuery had the head-start, but the chart above shows that Snowflake has huge momentum. Let’s get the ARIMA projections, using the basic UDF we wrote:

select *, simple_arima(data, 16) predictions
from (
select tag
, array_agg(q_views::int) within group (order by quarter) data
, array_agg(quarter) within group (order by quarter) quarters
from tag_stats
group by 1
)
Past & Future projections with naive ARIMA(1,1,1): Quarterly Stack Overflow pageviews for BigQuery, Redshift, Snowflake, Databricks, and Synapse.

This chart proves that naive ARIMA can feel the momentum too: It says that December 2025 is the quarter when Snowflake will get more quarterly pageviews than BigQuery in Stack Overflow.

Note that unadjusted ARIMA usually underestimates exponential growth curves, so the date could be even earlier. But this is all speculation, only time will tell what happens next.

SQL for visualizing with Snowsight

Since the Python UDF proposed here receives arrays and returns arrays, some work is needed to flatten these arrays into rows that Snowsight can visualize:

with x 
as (
select *, simple_arima(data, 16) predictions
from (
select tag
, array_agg(q_views::int) within group (order by quarter) data
, array_agg(quarter) within group (order by quarter) quarters
from tag_stats
group by 1
)
)
select tag, y.value::int v, quarters[y.index]::date quarter
from x, table(flatten(data)) y
union all
select tag
, y.value::int
, dateadd(
quarter
, 1+y.index
, quarters[array_size(quarters)-1]::date)
from x, table(flatten(predictions)) y

A better approach would be to create a UDF that receives and returns rows — but let’s call that homework for now.

Next steps

  • If you are serious about ARIMA, learn how to find the right parameters with Auto-ARIMA.
  • Take this challenge: Perform a hyperparameter optimization driving the UDF from a stored procedure, for better performance than Auto-ARIMA.
  • Rewrite the Python UDF to receive and return rows.
  • Share your results!
AUTO-Arima adjusted predictions

Want more?

I’m Felipe Hoffa, Data Cloud Advocate for Snowflake. Thanks for joining me on this adventure. You can follow me on Twitter and LinkedIn. And subscribe to reddit.com/r/snowflake for the most interesting Snowflake news.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Felipe Hoffa

Felipe Hoffa

Data Cloud Advocate at Snowflake ❄️. Originally from Chile, now in San Francisco and around the world. Previously at Google. Let’s talk data.