How to Anonymize Location Data in Snowflake

Background

Organizations frequently need to obfuscate geospatial data in order to share it internally or externally without divulging sensitive location information. For example, cities need to share the location of 911 calls with the public, but how do you do that without disclosing private citizens’ information? You could, for example, aggregate it to a zip code or census tract level, but that would strip the data of the level of detail that the public needs. Another method would be to “snap” the 911 locations to the nearest intersection. This ensures that the data retains usefulness while the caller’s identity is protected. This is the process the City of San Francisco follows now.

Source: DataSF

Typically this type of data transformation would require specialized geospatial data tools. However, with Snowflake’s geospatial data type and geospatial functions, you can perform this transformation directly in Snowflake, with the following benefits:

  • You can transform large amounts of data at scale
  • Sensitive location data doesn’t leave the Snowflake environment and stays protected by Snowflake’s end-to-end encryption, robust role based access and other governance controls
  • The anonymized data can be shared seamlessly with other accounts of Snowflake using Snowflake data sharing

In this article, I am using a dataset of street intersections provided through the City of San Francisco’s open data portal called DataSF, and a few randomly selected addresses within the city boundaries to demonstrate how you can anonymize locations by snapping them to the nearest intersection.

Data Preparation

First, we need to ingest the intersections dataset into Snowflake. There are many ways data can be injected into Snowflake. Here, I am reading it from an external stage with a previously configured storage integration.

/* set context */
use schema geo.public;
use warehouse adhoc;
/*Upload intersections dataset to Snowflake from an external stage*/
create or replace stage geo.public.geo
url = 's3://xxxxxxxxx/CCSF_Geo/'
storage_integration=s3_int;
create or replace file format csv_format
type = 'csv' skip_header=1;
create or replace table geo.public.ccsf_intersections
(ID number, CNN varchar, Street_Name_1 varchar,
Street_Name_2 varchar, X_Coord float, Y_Coord float,
Latitude float, Longitude float, ZIP_Code varchar);
copy into geo.public.ccsf_intersections
from @geo file_format=csv_format;
select * from ccsf_intersections limit 100;

Result:

Intersections table

To leverage Snowflake’s geospatial functions, lets modify this table by adding a geospatial point column derived from the intersections’ latitude and longitude. We’ll use the ST_MAKEPOINT function to convert the coordinates into Snowflake’s GEOGRAPHYdata type. Materializing this new column (as opposed to creating it on the fly in a view) is recommended for best performance, particularly for large datasets.

/*Add a geospatial point derived from the intersections' lat & lon*/
alter table ccsf_intersections add intersection_pt geography;
update ccsf_intersections
set intersection_pt=st_makepoint(longitude, latitude);
select * from geo.public.ccsf_intersections limit 100;

Result:

Intersection coordinates presented as geography data type

Next, let’s create a table of sample addresses. Let’s insert geographic coordinates directly into Snowflake’s GEOGRAPHY data type.

/*Create some sample addresses in San Francisco*/
create or replace table geo.public.sample_locations
(address varchar, address_pt geography);
insert into geo.public.sample_locations
values
('373 Frederick St, San Francisco, CA 94117', 'POINT(-122.44925113290317 37.76677567637685)'),
('1030 Stanyan St, San Francisco, CA 94117', 'POINT(-122.45207657651818 37.765137794256205)'),
('732 12th Ave, San Francisco, CA 94118','POINT(-122.46976217570905 37.77451094909229)')
;
select * from geo.public.sample_locations;

Result:

Sample Addresses

Geographic Anonymization

The query below finds the nearest intersection for each of the sample locations and displays the original vs anonymized coordinates. The ST_DWITHIN(a.address_pt,b.intersection_pt,200) in the WHERE clause subsets potential matches to intersections within 200 meters of a sample address. The window function with QUALIFY syntax QUALIFY ROW_NUMBER() OVER (PARTITION BY...)<=1 finds the single nearest intersection for each address.

If you are not familiar with QUALIFY, it does with window functions what HAVING does with aggregate functions and GROUP BY : for each partition — in this case, each unique address — you indicate how many matching rows to return. If this solution strikes you as simple, it is! IT’S REALLY THAT SIMPLE. That’s the beauty of Snowflake.

select 
a.address call_address,
ST_X(address_pt) original_lon,
ST_Y(address_pt) original_lat,
street_name_1||' & '||street_name_2 nearest_intersection,
ST_X(intersection_pt) anonymized_lon,
ST_Y(intersection_pt) anonymized_lat,
ST_DISTANCE(a.address_pt,b.intersection_pt) AS meters_apart
from
geo.public.sample_locations a,
geo.public.ccsf_intersections b
where
ST_DWITHIN(a.address_pt,intersection_pt,200) --distance in meters
QUALIFY ROW_NUMBER() OVER
(PARTITION BY a.address ORDER BY meters_apart) <= 1;

Result:

Conclusion

This article described an approach to anonymization of geographic locations by replacing them with the location of the nearest intersection. This approach may not work for every scenario, for example, in rural areas the addresses may be too sparsely distributed, allowing for re-identification. So you may need to look for an alternative reference layer to snap to.

Looking more broadly, this approach is useful not only for anonymization, but also for finding the nearest one or more neighbors to a reference point. You can use it, for example, to understand distances to the nearest pharmacy, grocery store, bank, etc. Keep in mind that this approach calculates distances as the crow flies, rather than a routed driving/walking distance. The latter is something I am exploring further — stay tuned.

Hope you enjoyed reading this short article. I encourage you to leave comments if you have questions or feedback.

--

--

Daria Rostovtseva
Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

Daria is a Sales Engineer at Snowflake helping government customers better serve the public by leveraging data