Loading data from GitHub or Public Website to Snowflake

Many times the data that you need to use are stored in various places like GitHub or need to use from openly accessible URLs. This brings complexity to the overall data engineering pipeline. Snowpark allows access to these external network locations securely using UDF and Stored Procedures.

For those who don’t know Snowpark, it is a robust open-source framework for Data Engineers that can take advantage of Snowflake's powerful platform which has a built-in high availability, unlimited scalability, performance, governance, and security features.

SnowPark Architecture

So, let’s see how easily this can be done in Snowpark:

Step 1: Setup networking rules and external access integration (This is a private preview feature as of Jul 2023)

USE ROLE ACCOUNTADMIN;

--create seprate role for network admin as a part of best practices
CREATE ROLE network_admin;
GRANT CREATE NETWORK RULE ON SCHEMA demodb.extfunc TO ROLE network_admin;
GRANT CREATE INTEGRATION ON ACCOUNT TO ROLE network_admin;
GRANT USAGE on DATABASE demodb to ROLE network_admin ;
GRANT USAGE on SCHEMA demodb.extfunc to ROLE network_admin ;
GRANT USAGE on WAREHOUSE demo_wh to ROLE network_admin;

-- create role hirarchey
GRANT ROLE network_admin to ROLE SYSADMIN;

USE ROLE network_admin;
USE WAREHOUSE demo_wh;

-- crete networking rules so only those sites are allowed for compliance
CREATE OR REPLACE NETWORK RULE web_access_rule
MODE = EGRESS
TYPE = HOST_PORT
VALUE_LIST = ('drive.google.com',
'raw.githubusercontent.com',
'api.exchangerate-api.com');

-- create exernal access integration with networking rule created above
CREATE OR REPLACE EXTERNAL ACCESS INTEGRATION external_access_int
ALLOWED_NETWORK_RULES = (web_access_rule)
ENABLED = true;

Step 2: Create a Stored Procedure or UDF

Please note that in the following code, we have networking rules that we set up in Step 1.

EXTERNAL_ACCESS_INTEGRATIONS = (external_access_int)

In order to load data from GitHub, you need to get the URL that gives raw data, which you can get by clicking the Raw button in the top right corner.

Then, log in to Snowflake, create a new worksheet in Snowsight, and use the code below:

USE ROLE SYSADMIN;
USE DEMODB.EXTFUNC;
USE WAREHOSUE DEMO_WH;

CREATE OR REPLACE PROCEDURE readcsv_proc(tname varchar, url varchar)
RETURNS STRING
LANGUAGE PYTHON
RUNTIME_VERSION = 3.8
HANDLER = 'readdata'
EXTERNAL_ACCESS_INTEGRATIONS = (external_access_int)
PACKAGES = ('snowflake-snowpark-python','pandas','urllib3')
AS
$$
import _snowflake
import pandas as pd
import urllib3
def readdata(session,TNAME, URL):
df=pd.read_csv(URL)
session.write_pandas(df, TNAME ,auto_create_table=True, overwrite=True)
return 'success'
$$;

-- Call this stored procedure to Load data from Github
-- make sure you click on RAW button to get this url
call readcsv_proc
('wiki_large_cities',
'https://raw.githubusercontent.com/Snowflake-Labs/fhoffa/a4d48f96e3309c809aa30bc51aff58a02daeadda/knoema_noaa_dbt/data/wikidata_large_cities.csv'
);

Voila, data available in Snowflake Table

Snowsight — Data Navigator

Similarly, if the source data is in JSON, then you can write Python UDF and use Variant to extract value as needed!


CREATE OR REPLACE FUNCTION extjson_udf(url string)
RETURNS STRING
LANGUAGE PYTHON
RUNTIME_VERSION = 3.8
HANDLER = 'getjson'
EXTERNAL_ACCESS_INTEGRATIONS = (external_access_int)
PACKAGES = ('urllib3')
AS
$$
import _snowflake
import urllib3
def getjson(URL):
http = urllib3.PoolManager()
resp = http.request("GET", URL)
return str(resp.data.decode())
$$;


select
parse_json(extjson_udf('https://api.exchangerate-api.com/v4/latest/USD'));

-- Insert it into table

create or replace table forex_rates
(
ts timestamp default current_timestamp(),
v variant
);

insert into forex_rates (v)
( select
parse_json(extjson_udf('https://api.exchangerate-api.com/v4/latest/USD'))
);

-- Lets see US Dollar to Indian Rupee
select ts, v:rates.INR from forex_rates;

-- TIP: create task periodically to update data
Output of SQL

Similarly, you can get the data from many sources, Google Drive, Dropbox, etc. using their API. As you can see, Snowpark is very powerful for data engineers to do complex tasks in a simple way, and with the power of Snowflake Platform for Performance, Cost, Security, and Governance.

During Snowflake Summit 2023, there are many announcements were made related to Snowpark, see below:

Snowpark Roadmap

Happy Coding!

Disclaimer: The opinions expressed in this post are my own and not necessarily those of my employer (Snowflake).

--

--