Snowflake
Published in

Snowflake

Generating all the Holidays in SQL — with a Python UDTF

“I want a table with all the Holidays” is a typical request in data analytics — but figuring this out is not as easy as it sounds. For which country, for which sub-region, for which timeframe — are only the initial questions. Then come some technical challenges, that we’ll figure out in this post with Snowflake and Python — including how to update the Anaconda pre-installed packages in Snowpark.

US Holidays in 2024, source ‘python-holidays’
US Holidays in 2022, source ‘python-holidays’. Note the “Observed” dates for Christmas and Juneteenth.
US (California) Holidays in 2022, source ‘python-holidays’. Note some California specific dates.
US (Alabama) Holidays in 2022, source ‘python-holidays’.
select *
from table(holidays('US', [2024]));
select *
from table(holidays('US', [2022]));
select *
from table(holidays('US-CA', [2022]));
select *
from table(holidays('US-AL', [2022]));

Installing python-holidays in Snowflake

First attempt: Outdated library

It should be straightforward to get the python-holidays package installed with Anaconda in Snowflake, but it’s not.

create or replace function debug_holidays()
returns string
language python
runtime_version = '3.8'
packages = ('holidays')
handler='x'
as $$
import holidays
def x():
return holidays.__version__
$$;
select debug_holidays();
-- 0.11.3.1

Second attempt: Install your own package

If Anaconda doesn’t provide you the desired package in Snowflake, you can bring your own. In this case we’re going to clone the latest python-holidays from GitHub, compress it into .zip file, and then we can bring it into a Snowflake stage with snowsql:

$ git clone https://github.com/dr-prodigy/python-holidays.git$ cd python-holidays$ zip -r holidays holidays$ snowsql> put 'file://holidays.zip' @~/ auto_compress=false overwrite=true+--------------+--------------+-------------+-------------+
| source | target | source_size | target_size |
|--------------+--------------+-------------+-------------+
| holidays.zip | holidays.zip | 174097 | 174112 |
+--------------+--------------+-------------+-------------+
create or replace function debug_holidays()
returns string
language python
runtime_version = '3.8'
imports=('@~/holidays.zip')
handler='x'
as $$
import holidays
def x():
return holidays.__version__
$$;
select debug_holidays();# ModuleNotFoundError: No module named 'dateutil' in function DEBUG_HOLIDAYS with handler x

Third attempt: Python dependencies

The previous message says, that dateutil is missing, and that’s easy to fix with the modules provided by Anaconda:

create or replace function debug_holidays()
returns string
language python
runtime_version = '3.8'
packages = ('python-dateutil')
imports=('@~/holidays.zip')
handler='x'
as $$
import holidays
def x():
return holidays.__version__
$$;
select debug_holidays();-- ModuleNotFoundError: No module named 'hijri_converter' in function DEBUG_HOLIDAYS with handler
create or replace function debug_holidays()
returns string
language python
runtime_version = '3.8'
packages = ('python-dateutil', 'hijri-converter', 'convertdate', 'korean_lunar_calendar')
imports=('@~/holidays.zip')
handler='x'
as $$
import holidays
def x():
return holidays.__version__
$$;
select debug_holidays();# 0.14.2

Bonus attempt: Import precedences

The above solutions solves our dependencies problem, but I wanted to share this finding. What happens if we mix both our own provided packages with Anaconda’s?

create or replace function debug_holidays()
returns string
language python
runtime_version = '3.8'
packages = ('holidays')
imports=('@~/holidays.zip')

handler='x'
as $$
import holidays
def x():
return holidays.__version__
$$;
select debug_holidays();
-- 0.14.2
  • Importing our .zip version of python-holidays takes precedence over the specific one brought for Anaconda, so we get the latest version running.

Should we use these hacks?

Since Python UDFs in Snowflake are in Public Preview, I expect python-holidays to work without the need of knowing these workarounds in the future. However, these tricks will still be useful whenever you run into a similar problem or want to provide your own patches to existing packages.

Designing a Python table UDF to generate a table

This table UDF can generate a list of Holidays for any country, sub-region, and range of years:

create or replace function holidays(
country_region string, years array
)
returns table(day date, holiday string)
language python
runtime_version = '3.8'
imports=('@~/holidays.zip')
packages = ('python-dateutil', 'hijri-converter', 'convertdate', 'korean_lunar_calendar')
handler='X'
as
$$
from datetime import date
import holidays
class X:
def __init__(self):
pass
def process(self, country_region, years):
parsed_country_region = country_region.split('-')
country = parsed_country_region[0]
region = (
parsed_country_region[1]
if len(parsed_country_region) > 1
else None
)
range_years = range(min(years), 1 + max(years))
holidays_dict = holidays.country_holidays(
country
, subdiv=region
, years=range_years
)
return(((k, v) for k, v in holidays_dict.items()))
def end_partition(self):
pass
$$;

select * from table(holidays('US', [2022]));
select * from table(holidays('US-CA', [2023]));
select * from table(holidays('US', [2010, 2030]));
select * from table(holidays('Chile', [2007]));
select * from table(holidays('UnitedKingdom', [2007]));
  • Find the full list of covered countries and sub-regions on GitHub.
  • python-holidays wants the sub-region as an additional parameter, but for SQL usability I decided to mix both with a - (like in ‘US-CA’), and split them manually in the Python UDF.
  • python-holidays wants each year to generate in the input array, but for SQL usability I decided to look at the min/max year in the array and generate a range of years in between.
  • python-holidays also supports a Holiday calendar for the European Central Bank and the New York Stock Exchange — but I have not added support for them on the UDF above (yet).
  • At the end of process() the UDF returns a generator that Snowflake will use to output the desired rows.
  • The most efficient use of the above tabular UDF would be to create a table with its results and then use the static results thereafter.

A simple UDF: is_holiday()

Instead of generating a table of Holidays, we can also create a UDF that simply returns if a date is a Holiday or not:

create or replace function is_holiday(
date date, country_region string
)
returns boolean
language python
runtime_version = '3.8'
packages = ('holidays')
imports=('@~/holidays.zip')
handler='x'
as $$
import holidays
def x(date, country_region):
parsed_country_region = country_region.split('-')
country = parsed_country_region[0]
region = (
parsed_country_region[1]
if len(parsed_country_region) > 1
else None
)
return date in holidays.country_holidays(country, subdiv=region)
$$;
select is_holiday('2021-01-01', 'US')
# TRUE

Next steps

Check these posts from Dash Desai and Miles Adkins with details on how to use GitHub Actions to deploy custom Python code to Snowflake — and solve dependency issues:

Want more?

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Felipe Hoffa

Data Cloud Advocate at Snowflake ❄️. Originally from Chile, now in San Francisco and around the world. Previously at Google. Let’s talk data.