Published in

Snowflake

# 2022–09 update

Facebook Prophet now runs inside Snowflake — check the post for more.

# Running pmdarima

Let’s get straight to the point. The following is a Python UDF that receives an array of numbers, and returns an array of predictions — as forecasted by ARIMA provided in the package `pmdarima`:

`create or replace function simple_arima(data array, predictions int)returns variantlanguage pythonruntime_version = 3.8packages = ('pmdarima')handler = 'udf'as \$\$from pmdarima.arima import ARIMAdef udf(data, predictions):    return ARIMA(        order=(1,1,1)      ).fit(y=data).predict(predictions).tolist()\$\$;`

That’s it — and this is available right now in Public Preview for all Snowflake accounts. Let’s test it by asking for 5 forecasted points to a simple series:

`select simple_arima([1,2,3,4,5], 5);`

It’s a naive approach, but it works:

# Alternative approaches

Before we look at the caveats of the above method, let’s review some alternatives to the above Python UDF:

• Facebook Prophet as an external function: Before having Python UDFs available in Snowflake, I wrote how to connect Prophet as an external function. Check my previous post.

# Naive ARIMA caveats

In the proposed UDF you can see that I simply called `ARIMA(order=(1,1,1))`. In that call `(1,1,1)` are the ARIMA `(p,d,q)`parameters that deserve way more attention than I’ve given them here. Finding out the ideal values is an art, as the seasonal_order `(P,D,Q,s)` that we totally ignored here.

The package `pmdarima` also offers the class `AutoARIMA` that automates the process of finding the best parameters for an ARIMA model. No time to check these important details in this post, but it deserves your attention as you go deeper into ARIMA and forecasting.

# Let’s play with predictions: Snowflake in Stack Overflow

I love Stack Overflow, and tracking how many pageviews Snowflake is getting — compared to other products in the space. Coincidentally this quarter Snowflake finally caught up to Redshift, as they are both getting an almost identical number of pageviews per quarter in Stack Overflow. We can see Databricks slightly behind, and Azure Synapse at the bottom, as visualized in Snowflake’s Snowsight:

I’m not surprised that BigQuery has way more quarterly pageviews than the rest — I was part of that team for 8 years. Since those days I’ve loved seeing Snowflake grow (I switched teams almost 2 years ago). I’m proud of my work with BigQuery, and I’m personally invested in making Snowflake the best alternative for you — now, and in the future.

So the natural question is: When will Snowflake catch up to the quarterly pageviews that BigQuery gets? BigQuery had the head-start, but the chart above shows that Snowflake has huge momentum. Let’s get the ARIMA projections, using the basic UDF we wrote:

`select *, simple_arima(data, 16) predictionsfrom (   select tag      , array_agg(q_views::int) within group (order by quarter) data      , array_agg(quarter) within group (order by quarter) quarters   from tag_stats   group by 1)`

This chart proves that naive ARIMA can feel the momentum too: It says that December 2025 is the quarter when Snowflake will get more quarterly pageviews than BigQuery in Stack Overflow.

Note that unadjusted ARIMA usually underestimates exponential growth curves, so the date could be even earlier. But this is all speculation, only time will tell what happens next.

# SQL for visualizing with Snowsight

Since the Python UDF proposed here receives arrays and returns arrays, some work is needed to flatten these arrays into rows that Snowsight can visualize:

`with x as (  select *, simple_arima(data, 16) predictions  from (    select tag      , array_agg(q_views::int) within group (order by quarter) data      , array_agg(quarter) within group (order by quarter) quarters    from tag_stats    group by 1  ))select tag, y.value::int v, quarters[y.index]::date quarterfrom x, table(flatten(data)) yunion allselect tag  , y.value::int  , dateadd(        quarter        , 1+y.index        , quarters[array_size(quarters)-1]::date)from x, table(flatten(predictions)) y`

A better approach would be to create a UDF that receives and returns rows — but let’s call that homework for now.

# Next steps

• If you are serious about ARIMA, learn how to find the right parameters with Auto-ARIMA.
• Take this challenge: Perform a hyperparameter optimization driving the UDF from a stored procedure, for better performance than Auto-ARIMA.
• Rewrite the Python UDF to receive and return rows.
• Share your results!

# Want more?

I’m Felipe Hoffa, Data Cloud Advocate for Snowflake. Thanks for joining me on this adventure. You can follow me on Twitter and LinkedIn. And subscribe to reddit.com/r/snowflake for the most interesting Snowflake news.

--

--

## Get the Medium app

Data Cloud Advocate at Snowflake ❄️. Originally from Chile, now in San Francisco and around the world. Previously at Google. Let’s talk data.