How to use Python UDFs for Geospatial use cases in Snowflake

“About 80% of data is geographic”, or so goes an urban legend popular in the GIS community. It’s easy to believe that number just by looking at how many Snowflake customers use a spatial data type (Geography or Geometry) or store their geospatial coordinates in latitude and longitude fields. We recently looked — it’s more than two-thirds of our customers!

Snowflake has been offering the Geography data type since 2020 and recently (August 2022) introduced Geometry in Public Preview. For each of those types, Snowflake offers 50+ native functions that follow the OGC specifications and cover a large variety of use cases. But what if users need a function not supported out of the box? An easy solution to that are Python user-defined functions that can be called as though they were regular SQL functions. Did you know you can use our Snowpark development and execution environment to create and run Python UDFs easily?

What are Snowflake Python UDFs and UDTFs?

A Python UDF is a user-defined function written in Python instead of SQL that can be called from a Snowflake query or script in the same way a built-in function can. Python UDFs can be scalar functions, meaning for each row passed to the UDF, the UDF returns a single value. They can be vectorized UDFs that receive batches of input rows as Pandas DataFrames and return batches of results as Pandas arrays or Series. Or they can be user-defined table functions (UDTFs) that return tabular results.

Inside the Python UDF you can use many of the Geospatial libraries you might be already familiar with. Here are some of the libraries that Snowflake supports today:

Shapely — for manipulating and analyzing geometric objects in the Cartesian plane.

GDAL — arguably the most popular library for reading and writing raster and vector geospatial data formats.

Fiona — a library that reads and writes spatial data files.

PyPROJ — Python interface to PROJ, a library for performing conversions between cartographic projections.

Geopandas — Geographic pandas extensions.

H3 — wrapper for the H3 core library.

You can check the full list of libraries supported out of the box in the Snowflake Anaconda Channel. If you need a library not listed there, you can bring it in via stage imports.

What can you do with Python UDFs?

Let’s look at some scenarios where users could benefit from Python UDFs. While some of these functions might be released as part of our native Geospatial library in the future, you can get things done today by following one of the following approaches.

Create extended geo functions. Users who need an ST_ function that is not on the currently supported list can create it by leveraging UDFs. Here is the code sample for a function that does aggregated union and uses the Shapely library:

CREATE OR REPLACE FUNCTION PY_UNION_AGG(g1 array)
returns geography
language python
runtime_version = 3.8
packages = ('shapely')
handler = 'udf'
AS $$
from shapely.ops import unary_union
from shapely.geometry import shape, mapping
def udf(g1):
shape_union = unary_union([shape(i) for i in g1])
return mapping(shape_union)
$$;

PY_UNION_AGG takes as a parameter an array of JSON values. You can create it like this:

SELECT PY_UNION_AGG(ARRAY_AGG(st_asgeojson(<COLUMN_NAME>))) FROM <TABLE_NAME>;

Here is an example of using the PY_UNION_AGG UDF:

Repair invalid spatial objects. Today, Snowflake allows storing shapes with self-intersections or spikes (this feature is in preview). After ingesting an invalid spatial object, users might want to fix it. The Shapely library can help with it, and here is an example of a vectorized UDF with batch size = 1000 (rows):

CREATE OR REPLACE FUNCTION PY_MAKEVALID(geo GEOGRAPHY)
returns geography
language python
runtime_version = 3.8
packages = ('shapely', 'pandas')
handler = 'udf'
AS $$
import shapely
import pandas
from _snowflake import vectorized
from shapely.geometry import shape, mapping
from shapely.validation import make_valid
@vectorized(input=pandas.DataFrame, max_batch_size=1000)
def udf(df):
valid_df = df[0].apply(lambda x: mapping(make_valid(shape(x))))
return valid_df
$$;

Here is an example of how to call this function:

select py_makevalid(to_geography('POLYGON((-98.854908 39.518387,
-83.3978497 38.744473,-104.649423 31.751698,
-84.254971 31.570112,-98.854908 39.518387))', True)) as repaired_shape;

Visually, the shapes look the same before and after repair. What changes is how they are defined, namely polygons, after repairing, often become multipolygons.

Reading shapefiles. Although the Shapefile format was introduced in the early 1990s, today, it’s still widely used to store spatial data. If users need to import spatial data from Shapefile-formatted files to Snowflake, they can use Python UDTFs and the Fiona library. Here are a few steps users need to take to load data from this type of data source.

  1. Upload the zip archive with Shapefiles files to the pre-created stage
PUT file:///Users/<LOCAL_PATH>/archive.zip @geostage AUTO_COMPRESS = FALSE

2. Create a function to read Shapefiles from the stage and convert it to a table. Note that the UDF contains the path to the zip file you uploaded in the previous step.

CREATE OR REPLACE FUNCTION PY_LOAD_SHAPEFILE(PATH_TO_FILE string)
returns table (wkb binary, properties object)
language python
runtime_version = 3.8
imports=('@geostage/archive.zip')
packages = ('fiona', 'shapely')
handler = 'ShapeFileReader'
AS $$
import fiona
from shapely.geometry import shape
import sys
IMPORT_DIRECTORY_NAME = "snowflake_import_directory"
import_dir = sys._xoptions[IMPORT_DIRECTORY_NAME]
class ShapeFileReader:
def process(self, PATH_TO_FILE: str):
shapefile = fiona.open(f"zip://{import_dir}/archive.zip/{PATH_TO_FILE}")
for record in shapefile:
yield ((shape(record['geometry']).wkb, dict(record['properties'])))
$$;

Below is a SQL command you need to run to import spatial data from the staged Shapefile:

SELECT * FROM TABLE(PY_LOAD_SHAPEFILE('<PATH_TO_FILE>'));

Similarly, you can read other data formats such as MID/MIF/TAB/etc. Here is an example of UDF for reading KML:

CREATE OR REPLACE FUNCTION PY_LOAD_KML(PATH_TO_FILE string)
returns table (wkb binary, properties object)
language python
runtime_version = 3.8
imports=('@geostage/archive.zip')
packages = ('fiona', 'shapely')
handler = 'KMLReader'
AS $$
import fiona
from shapely.geometry import shape
import sys
IMPORT_DIRECTORY_NAME = "snowflake_import_directory"
import_dir = sys._xoptions[IMPORT_DIRECTORY_NAME]

class KMLReader:
def process(self, PATH_TO_FILE: str):
fiona.drvsupport.supported_drivers['libkml'] = 'rw'
fiona.drvsupport.supported_drivers['LIBKML'] = 'rw'
shapefile = fiona.open(f"zip://{import_dir}/archive.zip/{PATH_TO_FILE}")
for record in shapefile:
yield ((shape(record['geometry']).wkb, dict(record['properties'])))
$$;

Note that since we used UDTF, both functions, PY_LOAD_SHAPEFILE and PY_LOAD_KML, return a table and not a singular value.

Transformations between spatial reference systems. In August 2022, Snowflake introduced support of the GEOMETRY data type that allows users to use local coordinate systems. For transformations between coordinate systems, they can write a UDF that uses a Python version of the PyPROJ library:

CREATE OR REPLACE FUNCTION PY_TRANSFORM(g1 GEOMETRY, srid_from NUMBER, srid_to NUMBER)
returns geometry
language python
runtime_version = 3.8
packages = ('pyproj','shapely')
handler = 'udf'
AS $$
import pyproj
import shapely
from shapely.ops import transform
from shapely.geometry import shape
from shapely import wkb, wkt
def udf(g1, srid_from, srid_to):
source_srid = pyproj.CRS(srid_from)
target_srid = pyproj.CRS(srid_to)
project = pyproj.Transformer.from_crs(source_srid, target_srid, always_xy=True).transform
shape_wgs = shape(g1)
shape_tr = transform(project, shape_wgs)
return shapely.geometry.mapping(shape_tr)
$$;

When calling this function, you should set a new SRID via the Snowflake-native ST_SETSRID function:

SELECT ST_SETSRID(PY_TRANSFORM(to_geometry('POLYGON((-97.266975 37.9387717,
-100.042662 34.337848,-88.284724 34.205826,
-97.266975 37.938771))'), 4326, 3443), 3443) as transformed_shape;

In this post, we showed just a few examples of how Python UDFs can be used for geospatial use cases in Snowflake. Please review the Snowflake documentation on Python UDFs and Geospatial Data Types for more information.

--

--