Load Shapefiles Into Snowflake — The Easy Way

Dynamic file access unlocks new use cases

Background

A few months ago, I published an article about leveraging shapefiles in Snowflake. If you are not familiar with shapefiles, it’s a file format developed by Esri that is frequently used to store vector geospatial features, such as points, lines and polygons. Well, things move so fast around here that it is time to update the article. What’s changed? In an nutshell, Snowflake’s new dynamic file access feature enables you to simply drop your shapefile into a Snowflake stage, create a Python UDF using Snowpark, run the UDF and — voila!— your shapefile is converted into a Snowflake table with geospatial features ready to be used for queries.

Note #1: Dynamic file access feature is now generally available. To learn more about this feature, check out this article by my colleague Jeremiah Hansen.

Note #2: Credit for the UDF code goes entirely to my colleague Oleksii Bielov and Snowflake’s amazing geospatial team.

Here’s how to leverage shapefiles with this new approach, step by step:

  1. Upload your shapefile in the form of a .zip archive into a Snowflake stage. If you are using an internal stage, you can now simply load the the file using the Snowflake UI — assuming the file is under 50MB in size. For larger file, run the PUT command from SnowSQL or a Python client to load the file into a stage.

For the sake of this example, I grabbed a shapefile of LA County airport noise contours from the City of Los Angeles geohub.

2. Create a user-defined python table function that accesses the file in the stage and outputs a table with two columns: wkb containing geospatial information in a well-known binary format, and properties, which contains a JSON of all other metadata contained in the shapefile (e.g., IDs, area, and anything else that happens to be included).

CREATE OR REPLACE FUNCTION PY_LOAD_GEOFILE(PATH_TO_FILE string, FILENAME string)
returns table (wkb binary, properties object)
language python
runtime_version = 3.8
packages = ('fiona', 'shapely', 'snowflake-snowpark-python')
handler = 'GeoFileReader'
AS $$
from shapely.geometry import shape
from snowflake.snowpark.files import SnowflakeFile
from fiona.io import ZipMemoryFile
class GeoFileReader:
def process(self, PATH_TO_FILE: str, filename: str):
with SnowflakeFile.open(PATH_TO_FILE, 'rb') as f:
with ZipMemoryFile(f) as zip:
with zip.open(filename) as collection:
for record in collection:
if (not (record['geometry'] is None)):
yield ((shape(record['geometry']).wkb, dict(record['properties'])))
$$;

3. Run the function:

-- extract a geometry for Los Angeles county airport noise contours
create or replace table GEODATA.PUBLIC.LAC_AIRPORT_NOISE_CONTOURS as
SELECT
properties,
to_geography(wkb, True) as geography
FROM
table(PY_LOAD_GEOFILE
(build_scoped_file_url
(@GEODATA.PUBLIC.GEO_STG, 'Airport_Noise_Contours.zip'),
'Airport_Noise_Contours.shp')
);

-- review results
select *
from GEODATA.PUBLIC.LAC_AIRPORT_NOISE_CONTOURS limit 100;
Result: shapefile ingested as a Snowflake table

4. (Optional) Parse extracted properties, using Snowflake’s simple dot notation. The available objects will be unique to your shapefile, so study the properties column to determine what you need.

--parse properties
select
properties:OBJECTID::string id,
properties:AIRPORT_NA::string airport,
properties:CLASS::string class,
geography
from GEODATA.PUBLIC.LAC_AIRPORT_NOISE_CONTOURS limit 100;
Result with parsed properties

And that’s it! Before you go, a few things to point out:

  • The PY_LOAD_GEOFILE function accepts two inputs: the location of the zipped shapefile archive and the name of the .shp file inside the archive. The location of the archive is passed in the form of a scoped file URL.
  • PY_LOAD_GEOFILE is a table UDF (rather, UDTF) — so we use the table()wrapper to parse the results into columns and rows.
  • It’s important to be aware of the shapefile’s spatial reference. If geographic coordinates are unprojected (i.e., representing locations on the Earth’s surface as points on a 3D sphere using WGS84 geographic coordinate system), it is appropriate to parse the geographic coordinates into Snowflake’s geography data type. In my example, the data is in WGS84, and parsed using to_geography function. If the coordinates are projected (representing locations on the Earth’s surface using a two-dimensional Cartesian coordinate system), they should be parsed into geometry data type using one of the geometry functions. To figure out if the shapefile is projected or unprojected, you can look at the .prj file inside the zipped shapefile archive.

Bonus: Can I please have my coffee in peace?

I sometimes travel to LA for work. Now that I have the data on areas in the city likely impacted by the airport noise, I can be more picky about where I stop by for coffee. I can grab a free Starbucks locations dataset from the Snowflake Marketplace, courtesy of SafeGraph, and join it to the airport noise contours:

--spatially join Starbucks locations and airport noise contours
select distinct
st_within(st_point(sbx.longitude,sbx.latitude),geo.geography) TOO_NOISY,
sbx.LOCATION_NAME,
sbx.STREET_ADDRESS,
sbx.CITY,
sbx.OPEN_HOURS
from GEODATA.PUBLIC.LAC_AIRPORT_NOISE_CONTOURS geo,
STARBUCKS_LOCATIONS.PUBLIC.CORE_POI sbx
order by 1 desc
;

Here’s where I may or may not go:

Thanks for reading this! Please feel free to leave feedback.

--

--

Daria Rostovtseva
Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

Daria is a Sales Engineer at Snowflake helping government customers better serve the public by leveraging data