How to Geocode Addresses in Snowflake

Geocoding is a basic data enhancement function that enables you to turn addresses into points on the map, calculate distances between geographic locations, determine which geographic boundaries an entity falls into — county, zip code, congressional district, etc., — and so much more. In my work as a sales engineer at Snowflake, the question of how to geocode street address data in Snowflake comes up all the time. In this article I will describe how to leverage the AWS Location Service for geocoding addresses in Snowflake using external functions.

What is a Snowflake external function?

A Snowflake external function is a user-defined function that is executed outside of Snowflake. An external function is required to call an external API for tasks such as geocoding data, processing images and text, executing machine learning models or complex business logic, or accessing live data such as stock prices. User-defined functions and secure data sharing in Snowflake can accomplish some of these tasks directly in Snowflake, but if you need to leverage an external API, you must use external functions.

From technical standpoint, external functions must expose an HTTPS endpoint, accept JSON input and return JSON output, and be callable from a proxy service. External functions are scalar, meaning that they return one value per row of input.

The main steps in the execution of an external function are as follows:

1. A client program passes Snowflake a SQL statement that calls an external function. For example:

Sample statement calling an external function

2. Snowflake reads the definition of the external function which contains the URL of the proxy service and the name of the API integration (a Snowflake object) that contains the authentication information for that proxy service.

3. Snowflake reads information from the API integration and composes an HTTP POST request which contains the HTTP header information, the data to be processed, the proxy service resource to use, and the authentication information for that proxy service.

4. The POST request is sent to the proxy service, which reads the request and forwards it to the actual remote service.

5. The remote service processes the data and passes it back to Snowflake.

External Function Flow

This excellent tutorial explains in detail how to get a basic external function to work in AWS using the AWS API Gateway proxy service and an AWS Lambda function (remote service). The tutorial follows the Snowflake documentation and the sample Lambda code written in Python. I will let you hop through the steps in the tutorial on your own, and my geocoding example picks up from there.

AWS Lambda Function to Geocode an Address String

Building on the example from the Snowflake tutorial, I modified the basic function to accept an address string and return a geocoded result, using the AWS Location Service. The AWS Location Service provides a secure and cost-effective way to add location functionality to a variety of business applications.

To call the AWS Location Service inside the Lambda function, I used Boto3. Boto3 is an AWS SDK for Python which enables you to call a variety of AWS services. The specific Boto3 client that is designed for various location-related tasks is called LocationService, and the method for geocoding an address string is called search_place_index_for_text. This method has a number of arguments: TextandIndexNameare required, and the rest are optional. In my example, I used the following arguments:

  • Text (required) — address string to geocode
  • IndexName (required)— geographic data provider
  • FilterCountries (optional)— filter for the country code(s)
  • BiasPosition (optional) — preference for places that are closer to a specified position (geographic coordinates)
  • MaxResults (optional) — the maximum number of results I want to see

AWS Lambda Function Code:

#AWS Lambda Function for Geocoding#import json#import AWS Boto3
import boto3
#define boto3 client
client = boto3.client('location')
def lambda_handler(event, context):# 200 is the HTTP status code for "ok".
status_code = 200
# The return value will contain an array of arrays
#(one inner array per input row).
array_of_rows_to_return = [ ]
try:
# From the input parameter named "event",
#get the body, which contains the input rows.
event_body = event["body"]
# Convert the input from a JSON string into a JSON object.
payload = json.loads(event_body)

# This is basically an array of arrays.
# The inner array contains the row number, and a value
# for each parameter passed to the function.
rows = payload["data"]
# For each input row in the JSON object...
for row in rows:
# Read the input row number
row_number = row[0]
# Read in the string to geocode
input_value = row[1]

# Compose the output
# Refer to Boto3 documentation for info about
# location service parameters
response = client.search_place_index_for_text(
IndexName='explore.place',
FilterCountries=["USA"],
BiasPosition=[-177.386,33.938],
MaxResults=1,
Text=input_value
)
location_response = response["Results"]
print(location_response)

# Put the returned row number
# and the returned value into an array.
row_to_return = [row_number, location_response]

# ... and add that array to the main array.
array_of_rows_to_return.append(row_to_return)
json_compatible_string_to_return = json.dumps({"data" : array_of_rows_to_return})except Exception as err:
# 400 implies some type of error.
status_code = 400
# Tell caller what this function could not handle.
json_compatible_string_to_return = event_body
# Return the return value and HTTP status code.
return {
'statusCode': status_code,
'body': json_compatible_string_to_return
}

To confirm that the function works as expected, I configured a test in the AWS console with some sample data:

AWS Lambda test

… and received this response:

AWS Lambda test response

If you get a 400 error, you are most likely executing the function with a role that does not have the privileges to use the Location Service. To fix this, I created an AWS IAM policy with permissions to all location actions:

… and attached this policy to the role executing my lambda function.

The remaining steps on the AWS side are the same as described in the external function tutorial. For simplicity, I reused the API Gateway that was created for earlier lambda functions and added the geocoding function as a new POST method. My deployed API endpoint is https://*********.execute-api.us-east-1.amazonaws.com/sf-stage/sf-address-geocode

Snowflake external function for geocoding

Assuming that you have implemented the API integration in Snowflake as part of the tutorial, there are just a couple of steps left to configure on the Snowflake side.

Create and test the external function

I defined the external function in Snowflake as shown in the snippet below. The function accepts a single input — the address string — and returns a JSON object in the Snowflake variant data type.

//Define Geocoding function
create or replace external function sf_geocode_address(address string)
returns variant
api_integration = my_api_integration
as 'https://**********.execute-api.us-east-1.amazonaws.com/sf-stage/sf-address-geocode';

Let’s test the function.

//Create sample data use role sysadmin;
create or replace database geo;
use database geo;
use schema public;
create table sample_addresses (address string);insert into sample_addresses
values
('4132 Locust St, Riverside, CA 92501'),
('4276 11th St, Riverside, CA 92501'),
('4265 El Dorado St, Riverside, CA 92501');
//Test the function
select
address,
sf_geocode_address(address) resp
from sample_addresses;

Here’s the output we get for the first row:

Sample geocoding output

All that is left to do is traverse the JSON output to get the data elements you need. In my case, I just needed the latitude and the longitude. Here’s the code to pull them out:

//Extract coordinateswith result as
(select
address,
sf_geocode_address(address) resp
from sample_addresses)
select
address,
resp[0].Place.Geometry.Point[0]::float longitude,
resp[0].Place.Geometry.Point[1]::float latitude
from result;

And the results:

Addresses matched to geographic coordinates

Just for fun, we can now through these results on the map in Tableau, using its native connector to Snowflake:

A quick note on troubleshooting when processing large volumes of addresses. If your function is executing fine on smaller amounts of data but you are running into errors when the row count exceeds 10,000, you may be running into an AWS API limit. You can confirm this by reviewing logs in AWS CloudTrail — you will see the “too many requests” error. AWS sets quotas for certain services to ensure that all customers have access to a service, and this is one of them. If you are processing a large volume of data as part of your initial batch process, you can break up your data into batches at or under 10,000 rows. Make sure that you are sending distinct addresses to avoid reprocessing the same address. If you expect to process large volumes of addresses, talk to your AWS account team.

In this article, I described the process of street address geocoding in Snowflake with the AWS Location Service, using a generic tutorial as starting point. If you are looking for more examples of external functions, the Snowflake Quick Starts contains self-paced tutorials for external functions using AWS, Azure and Google.

--

--

Daria Rostovtseva
Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

Daria is a Sales Engineer at Snowflake helping government customers better serve the public by leveraging data