How to Leverage Esri Shapefiles in Snowflake

Creative Attribution: Arlo James Barnes, Public domain, via Wikimedia Commons

Update

Since this article was published, Snowflake released new features that simplify the conversion of shapefiles. The new and simplified method is described in this new article.

TL;DR

Esri shapefiles can be manipulated with the Python GeoPandas package and uploaded into Snowflake for use in geospatial analysis, such as spatial joins, distance calculations and much more. Head to this repo for the script and sample data.

Introduction

Esri shapefile is a data format frequently used to store geospatial vector data, such as points, lines and polygons. Shapefiles are often used in specialized GIS software, such as ArcGIS, to understand spatial relationships between objects or to enrich non-geospatial data. For example, you can spatially join customers to service areas to understand the customer distribution.

As organizations adopt Snowflake, more and more of the granular data about customers and citizens moves to Snowflake for analytical processing. In order to leverage the investment in existing spatial data, analysts have a need to upload shapefiles into Snowflake, so they can be used in geospatial analysis where the customer and citizen data resides, taking advantage of Snowflake’s processing power and minimizing the data movement.

In an earlier blog post, I explained how to geocode addresses data in Snowflake using external functions. In this post, I show how you can upload shapefiles into Snowflake and leverage them for geospatial analysis.

Geospatial Data in Snowflake

Snowflake supports two types of geospatial data types: GEOGRAPHY and GEOMETRY (in public preview as of this writing). Supported input formats include Well-Known Text (WKT), Well-Known Binary (WKB), extended WKT/WKB, and GeoJSON. Shapefiles must be converted into one of the supported formats before they can be loaded into Snowflake. For small shapefiles, it is possible to simply convert them to GeoJSON, upload into a Snowflake stage, and copy into a Snowflake table as a variant data type. Then, you can flatten and parse the data to create one row per geographic feature (e.g., one row per polygon, line or point). However, the variant data type has a size limit of 16 MB compressed, so this method will not work for larger shapefiles. One way to get around this limit is to preprocess the shapefile using the Python package GeoPandas and upload it as a Python dataframe using the Snowflake Python connector.

Sample Data

In this exercise, I am using a shapefile of California legislative districts obtained on the California State Geoportal. My objective is to spatially join these districts to a table of US cell towers in Snowflake, so I can check for any disparities that might exist in the access to telecommunications throughout the state. The cell towers data is already in Snowflake and contains a geography data column with the towers’ point location.

Sample Cell Towers Data

Loading a Shapefile into Snowflake

This process involves the following steps:

Step 1. Prepare shapefile in Python

First, use GeoPandas to load the shapefile in Python and (optionally) drop unneeded columns.

#Import packages
import geopandas
import pandas as pd
import json
import snowflake.connector
from snowflake.connector.pandas_tools import write_pandas
#Open shapefile
shapefile=geopandas.read_file('CA_Legislative_Districts.zip')
#Drop unneeded columns shapefile=shapefile[['ID','AREA','DISTRICT','POPULATION','geometry']]
shapefile.head()
Sample Shapefile Data viewed with GeoPandas

Step 2. Change polygon projection to match that of point data

Geospatial data is tricky because it can be expressed with many projections enumerated with EPSG codes. Projections can be thought of as methods to place a coordinate from a sphere (the Earth) on a 2-dimensional plane.

The Snowflake GEOGRAPHY data type expects the coordinates to be in EPSG:4326, which corresponds to the WGS84 spherical (unprojected) coordinate system. For a spatial join to work correctly, both sets of coordinates must use the same projection. The Snowflake GEOMETRY data type enables you to use any projection, but changing projections in Snowflake is not yet supported.

Since the cell towers table uses the geography data type to store the towers’ coordinates, and therefore uses EPSG:4326, we need to ensure that the shapefile data also uses EPSG:4326 when it is loaded in Snowflake. We can use GeoPandas to check and change the projection of the shapefile data to match that of the cell towers.

#Check shapefile projection
shapefile.crs

We can see that the projection is different from EPSG:4326, so we convert it to EPSG:4326:

#Convert shapefile projection to WGS84 (unprojected)
shapefile_wgs84=shapefile.to_crs(4326)
shapefile_wgs84.crs

Note that it would not be necessary to change the projection of the shapefile if you were loading it into the GEOMETRY data type in Snowflake. The key is to ensure that all data involved in a spatial query is based on the same data type and projection.

Step 3: Convert GeoPandas dataframe to a Pandas dataframe

We will be loading the shapefile into Snowflake using the Snowflake Python connector, which expects a Pandas dataframe. Therefore, we need to convert our GeoPandas dataframe to a Pandas dataframe. In my testing, I also found that the connector was able to load the shapefile’s geometry column as a string, and also expected the column name to be upcased. This minor tweaking is included in the snippet below. For clarity, we will also rename the column containing the polygon coordinates into POLYGON_COORDINATES :

#Convert to pandas dataframe and convert and rename the geometry column
df = pd.DataFrame(shapefile_wgs84)
df=df.astype({'geometry':'str'}).rename(columns={'geometry':'POLYGON_COORDINATES'})
df.info()

The shapefile data is now ready to be loaded into Snowflake.

Step 4: Connect to Snowflake and Create a Table

Next, we connect to our Snowflake account and create a database, schema and table for storing the shapefile data. Notice that initially we create a temporary table (which will disappear when the Snowflake session ends) and use the varchar type for the POLYGON_COORDINATES column.

#Connect to Snowflake
with open('my_account_auth.json') as f:
data = json.load(f)
username = data['username']
password = data['password']
account = data["account"]
conn = snowflake.connector.connect(
user=username,
password=password,
account=account)
#Set up context and create a table to load the data
conn.cursor().execute("USE WAREHOUSE ADHOC")
conn.cursor().execute("CREATE DATABASE IF NOT EXISTS GEO") conn.cursor().execute("USE DATABASE GEO")
conn.cursor().execute("CREATE OR REPLACE SCHEMA GEO_BOUNDARIES")
conn.cursor().execute("USE SCHEMA GEO_BOUNDARIES")
conn.cursor().execute("CREATE OR REPLACE TEMPORARY TABLE \
CA_LEGISLATIVE_DISTRICTS_TMP \
(ID number, AREA float, \
DISTRICT varchar, \
POPULATION number, \
POLYGON_COORDINATES varchar)").fetchone()

Step 5: Load Data into Snowflake

Next we use the write_pandas function to copy the Pandas dataframe into a Snowflake table.

#Load data into Snowflake
success, nchunks, nrows, _ = write_pandas(conn, df, 'CA_LEGISLATIVE_DISTRICTS_TMP')
print(success, nchunks, nrows)

Step 6: Convert the POLYGON_COORDINATES column into geography data type

Finally, we create a permanent table with CA Legislative boundaries that is ready to use for geospatial analysis:

#Convert boundaries to geography data type and save as a permanent table
conn.cursor().execute("CREATE OR REPLACE TABLE \
CA_LEGISLATIVE_DISTRICTS \
AS SELECT \
ID, \
AREA, \
DISTRICT, \
POPULATION, \
TO_GEOGRAPHY(POLYGON_COORDINATES) \
as DISTRICT_BOUNDARY. \
FROM CA_LEGISLATIVE_DISTRICTS_TMP").fetchone()

Looking at a sample of this data in Snowflake, we can confirm that it is formatted as expected.

Shapefile data in Snowflake

Step 7: Run a sample spatial query

We can finally have some fun and run a spatial analysis. Let’s see how our legislative districts differ in terms of the number of cell phone towers per 1,000 population.

#Sample query: Which legislative districts have the highest and the lowest number of cell towers per 1,000 population?
conn.cursor().execute( "SELECT \
LD.DISTRICT, \
LD.POPULATION, \
COUNT(*) TOWER_CNT, \
ROUND((TOWER_CNT/POPULATION)*1000,0) \
TOWERS_PER_1000 \
from GEO.TELCO.CELL_TOWERS_US CT, \
GEO.GEO_BOUNDARIES.CA_LEGISLATIVE_DISTRICTS \
LD \
Where \
ST_CONTAINS(LD.DISTRICT_BOUNDARY,CT.GEO) \
GROUP BY 1,2 \
ORDER BY 4 DESC;").fetchall()

Visualizing this same query result in Snowsight, we can see that the number of towers per 1,000 residents varies between 1 and 14. Although all districts are roughly the same in terms of the size of the population, District 57 has only one tower per 1,000 residents:

Putting It All Together

#Import packages
import geopandas
import pandas as pd
import json
import snowflake.connector
from snowflake.connector.pandas_tools import write_pandas
#Open shapefile
shapefile=geopandas.read_file('CA_Legislative_Districts.zip')
#Drop unneeded columns shapefile=shapefile[['ID','AREA','DISTRICT','POPULATION','geometry']]
shapefile.head()
#Check shapefile projection
shapefile.crs
#Convert shapefile projection to WGS84 (unprojected)
shapefile_wgs84=shapefile.to_crs(4326)
shapefile_wgs84.crs
#Convert to pandas dataframe and convert and rename the geometry column
df = pd.DataFrame(shapefile_wgs84)
df=df.astype({'geometry':'str'}).rename(columns={'geometry':'POLYGON_COORDINATES'})
df.info()
#Connect to Snowflake
with open('demo308_auth.json') as f:
data = json.load(f)
username = data['username']
password = data['password']
account = data["account"]
conn = snowflake.connector.connect(
user=username,
password=password,
account=account)
#Set up context and create a table to load the data
conn.cursor().execute("USE WAREHOUSE ADHOC")
conn.cursor().execute("CREATE DATABASE IF NOT EXISTS GEO") conn.cursor().execute("USE DATABASE GEO")
conn.cursor().execute("CREATE OR REPLACE SCHEMA GEO_BOUNDARIES")
conn.cursor().execute("USE SCHEMA GEO_BOUNDARIES")
conn.cursor().execute("CREATE OR REPLACE TEMPORARY TABLE \
CA_LEGISLATIVE_DISTRICTS_TMP \
(ID number, AREA float, \
DISTRICT varchar, \
POPULATION number, \
POLYGON_COORDINATES varchar)").fetchone()
#Load data into Snowflake
success, nchunks, nrows, _ = write_pandas(conn, df, 'CA_LEGISLATIVE_DISTRICTS_TMP')
print(success, nchunks, nrows)
#Convert boundaries to geography data type and save as a permanent table
conn.cursor().execute("CREATE OR REPLACE TABLE \
CA_LEGISLATIVE_DISTRICTS \
AS SELECT \
ID, \
AREA, \
DISTRICT, \
POPULATION, \
TO_GEOGRAPHY(POLYGON_COORDINATES) \
as DISTRICT_BOUNDARY. \
FROM CA_LEGISLATIVE_DISTRICTS_TMP").fetchone()
#Sample query: Which legislative districts have the highest and the lowest number of cell towers per 1,000 population?
conn.cursor().execute( "SELECT \
LD.DISTRICT, \
LD.POPULATION, \
COUNT(*) TOWER_CNT, \
ROUND((TOWER_CNT/POPULATION)*1000,0) \
TOWERS_PER_1000 \
from GEO.TELCO.CELL_TOWERS_US CT, \
GEO.GEO_BOUNDARIES.CA_LEGISLATIVE_DISTRICTS \
LD \
Where \
ST_CONTAINS(LD.DISTRICT_BOUNDARY,CT.GEO) \
GROUP BY 1,2 \
ORDER BY 4 DESC;").fetchall()

Conclusion

Shapefiles can be successfully used inside Snowflake with some initial preprocessing. Once your shapefiles are in Snowflake, you can run your spatial analyses directly in Snowflake without the need to move the data around.

--

--

Daria Rostovtseva
Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

Daria is a Sales Engineer at Snowflake helping government customers better serve the public by leveraging data