Geoclustering with Snowflake and CARTO’s shared functions

This example explores Snowflake’s geospatial capabilities, and how to enhance them using CARTO shared functions in the Snowflake Data Marketplace.

Felipe Hoffa
Sep 24 · 5 min read

The Data Cloud not only allows you to share data, but also logic. Let’s check out the power of this by playing with GIS UDFs shared by CARTO.

Watch on YouTube

Let’s start by plotting 1,000 random points around Italy:

1,000 random points around Italy. As expected, they look like a boot.

Getting these random points was easy with a SQL query:

select sfcarto.random.st_generatepoints(
(
select geog
from countries_geojson_simple
where country_iso3='ITA'
), 1000)

This query uses 2 sources:

We are going to check how to set these up later, but for now let’s transform the array of points into a table of numbers to visualize in Snowflake, as in the earlier scatter chart:

select st_x(point), st_y(point)
from (
select to_geography(p.value) point
from (
select sfcarto.random.st_generatepoints((select geog from countries_geojson_simple where country_iso3='ITA'), 1000) points
), table(flatten(points)) p
)

When you have so much data around a map or chart, it’s nice to cluster it in separate groups. The CARTO shared libraries are ready to help with the st_clusterkmeans() shared UDF:

1,000 random points around Italy, separated in 5 clusters

Done with this SQL query:

select sfcarto.clustering.st_clusterkmeans(points, 5) kmeaned
from (
select sfcarto.random.st_generatepoints(
(select geog from countries_geojson_simple where country_iso3='ITA')
, 1000) points
)

And to visualize as a scatter chart in Snowflake:

with clustering as (
select sfcarto.clustering.st_clusterkmeans(points, 5) kmeaned
from (
select sfcarto.random.st_generatepoints(
(select geog from countries_geojson_simple where country_iso3='ITA')
, 1000) points
)
)
select st_x(point), st_y(point), cluster
from (
select to_geography(p.value:geom) point
, p.value:cluster::int cluster
from clustering, table(flatten(kmeaned)) p
)

We can do the same with other countries (guess the names?):

1,000 random points clustered around 2 countries. Can you guess which countries they are?
1,000 random points clustered around 2 countries. Can you guess which countries they are?

Setup

To run this demo, first we need a table with the countries’ shapes. The project datasets/geo-countries offers a public domain JSON with this data, and loading it into Snowflake is straightforward (after applying some tricks to split the massive JSON object):

-- wget https://github.com/datasets/geo-countries/blob/master/data/countries.geojson
-- snowsql: put file:///tmp/countries.geojson @~;
create or replace file format my_csv_format
type = 'csv'
field_delimiter = none;
create or replace table countries_geojson
as
select j:properties.ADMIN::string country
, j:properties.ISO_A3::string country_iso3
, geog
from (
select parse_json($1) j, try_to_geography($1) geog
from @~/countries.geojson.gz (file_format => my_csv_format)
where geog is not null
)
;

Note that the shapes of many of these countries are too complicated for the algorithm generating random points. A solution is to simplify the shape of each country by choosing only the largest single polygon for any country that is represented by more than one polygon:

create or replace table countries_geojson_simple as
select country, country_iso3
, to_geography(object_construct(
'coordinates'
, iff(
j:geometry.type='MultiPolygon'

, xv
, j:geometry.coordinates)
, 'type', 'Polygon'
)) geog
from (
select j:properties.ADMIN::string country
, j:properties.ISO_A3::string country_iso3
, j
, x.value xv
from (
select parse_json($1) j, try_to_geography($1) geog
from @~/countries.geojson.gz (file_format => my_csv_format)
where geog is not null
), table(flatten(j:geometry.coordinates)) x
qualify row_number() over(partition by country_iso3 order by array_size(x.value[0]) desc) = 1
)
;

To import CARTO shared UDFs into your account, ask your account admin to search for “CARTO spatial extension” in the Marketplace tab. It’s free, and ready to use. If you import it with the suggested db name “sfcarto” all the queries above will work.

CARTO also offers detailed docs for their Snowflake spatial extension:

Next steps

  • Find the “center of mass” of each cluster with sfcarto.clustering.st_centerofmass().
  • Write a query that transforms a traditional scatter chart into geo-points to apply CARTO’s geo-clustering UDF. Or write a UDF that finds 2d-clusters without the need to go through geo-transformations.

Watch “Spatial Analytics in the Cloud Using Snowflake & CARTO” with CARTO founder Javier de la Torre and Snowflake PM Sergei Sokolenko. In this webinar they explore many more of the datasets and functions that CARTO shares in Snowflake — while also exploring advanced use cases when connecting both.

Watch “Spatial Analytics in the Cloud Using Snowflake & CARTO

Also check CARTO’s announcement blog post:

Want more?

I’m Felipe Hoffa, Data Cloud Advocate for Snowflake. Thanks for joining me on this adventure. You can follow me on Twitter and LinkedIn. Check reddit.com/r/snowflake for the most interesting Snowflake news.

Snowflake

Articles for engineers, by engineers.