Cloudy SQL — Querying Snowflake Inside a Jupyter Notebook

A Jupyter Magics extension designed for Data Scientists to interact seamlessly with Snowflake

Sam Kohlleffel
Hashmap, an NTT DATA Company
6 min readJan 28, 2021

--

Background

The Snowflake Data Cloud is multifaceted providing scale, elasticity, and performance all in a consumption-based SaaS offering. Customers can load their data into Snowflake tables and easily transform the stored data when the need arises.

Among the many features provided by Snowflake is the ability to establish a remote connection. From this connection, you can leverage the majority of what Snowflake has to offer.

I will focus on two features: running SQL queries and transforming table data via a remote Snowflake connection.

One popular way for data scientists to query Snowflake and transform table data is to connect remotely using the Snowflake Connector Python inside a Jupyter Notebook.

Sample remote interaction using Snowflake Connector Python

The example above is a use case of the Snowflake Connector Python inside a Jupyter Notebook. I first create a connector object. Then, a cursor object is created from the connection. After creating the cursor, I can execute a SQL query inside my Snowflake environment. Finally, I store the query results as a pandas DataFrame. I can now easily transform the pandas DataFrame and upload it to Snowflake as a table.

From the example above, you can see that connecting to Snowflake and executing SQL inside a Jupyter Notebook is not difficult, but it can be inefficient. To address this problem, we developed an open-source Python package and Jupyter extension.

Cloudy SQL

Cloudy SQL is a pandas and Jupyter extension that manages the Snowflake connection process and provides a simplified way to execute SQL in Snowflake from a Jupyter Notebook.

Watch a demonstration video of Cloudy SQL in this Hashmap Megabyte:

Supported Methods

Cloudy SQL currently supports:

  1. A Jupyter magic method that allows users to execute SQL queries in Snowflake from a Jupyter Notebook easily
  2. Writing to an existing or new Snowflake table from a pandas DataFrame

Installation and Configuration

To optimize Cloudy SQL, a few steps need to be completed before use:

  • Run pip install cloudy-sql to install Cloudy SQL and its dependencies:
  • Upon installation, open an empty Jupyter notebook and run the following code in a Jupyter cell:

%load_ext cloudy_sql

After you run the above code, a configuration file will be created in your HOME directory. Cloudy SQL uses the information in this file to connect to Snowflake for you.

The path to the configuration file: $HOME/.cloudy_sql/configuration_profiles.yml

For Windows use $USERPROFILE instead of $HOME

The configuration file has the following format:

  • Open this file using the path provided above and fill out your Snowflake information to the applicable fields.

Note: Configuration is a one-time setup. Once you’ve configured the credentials file, you can use it for any project that uses Cloudy SQL.

How it works

When you call any Cloudy SQL magic or method, it uses the information stored in the configuration_profiles.yml to seamlessly connect to Snowflake.

Cloudy SQL currently supports two options to pass in Snowflake connection credentials and details:

  1. The above configuration file. configuration_profiles.yml serves as the place to store default Snowflake credentials.
  2. Pass in your Snowflake details as arguments when calling a Cloudy SQL magic or method. Any argument passed in will prioritize its corresponding default value stored in the configuration file when you use this option.

Using the Extension

To use Cloudy SQL in a Jupyter Notebook, you need to run the following code in a cell:

%load_ext cloudy_sql

API

The intent has been to keep the API as simple as possible by minimally extending the pandas and IPython Magic APIs.

%%sql_to_snowflake

IPython Cell Magic to seamlessly connect to Snowflake and run a query in Snowflake and optionally return a pandas DataFrame as the result when applicable.

%%sql_to_snowflake parameter options

Parameters

  • <SQL query> (Required)
    SQL query to be executed in Snowflake
  • <destination_var> (Positional)
    Variable to store the query results. If none is given, the magic will return the first 10 rows of the pandas DataFrame if applicable.
  • --params <params> (Optional)
    Parameters to be used in the SQL Query. Params must be passed in as a dictionary string in the format {"param_name": "param_value"} or reference a dictionary string defined in a previous cell. The use of a parameter in the query should be indicated with {{param_name}}.
  • --username <username> (Optional)
    The called method will connect to Snowflake with this username instead of the configuration file's default if provided.
  • --password <password> (Optional)
    If provided, the called method will connect to Snowflake with this password instead of the default configuration file.
  • --account <account> (Optional)
    If provided, the called method will connect to Snowflake with this account instead of the configuration file's default.
  • --role <role> (Optional)
    If provided, the called method will connect to Snowflake with this role instead of the default in the configuration file.
  • --warehouse <warehouse> (Optional)
    If provided, the called method will use this warehouse instead of the default in the configuration file.

write_snowflake

This method allows users to create a Snowflake table and write to that table with a pandas DataFrame. Users can also use this method to append data to an existing Snowflake table.

The only required argument to directly include is table.

This method works when writing to either an existing Snowflake table or a previously non-existing Snowflake table. If the table you provide does not exist, this method creates a new Snowflake table and writes to it. If the table already exists, the DataFrame data is appended to the existing table by default. If you would like to replace the table with the pandas, DataFrame set overwrite = True when calling the method.

Role and warehouse are optional arguments that can be set up in the configuration_profiles.yml.

Username, password, account, database, and schema are all required but can have default values set up in the configuration file.

Examples

Example configuration_profiles.yml

The example above shows how a user can leverage both the %%sql_to_snowflake magic and the write_snowflake method. It runs a SQL query with %%sql_to_snowflake and saves the results as a pandas DataFrame by passing in the destination variable df In [6]. The called %%sql_to_snowflake magic uses the Snowflake credentials found in the configuration file.

The example then shows how to easily write that df to a Snowflake table In [8]. The write_snowflake method uses the default username, password, account, database, and schema found in the configuration file.

The example above runs a SQL query with passed-in variables. The variables are used directly in the SQL query by placing each one inside {{ }}. A dictionary string parameters is passed in when the magic is called by including the--params inline argument and placing a $ to reference the dictionary string creating in the previous cell In [3]. The magic also uses the passed in snowflake_username instead of the default in the configuration file.

The example then shows how to overwrite the existing test_cloudy_sql table with the data in the df variable by setting overwrite = True In [5]. The user then drops the table In [6].

Get the Code

Find the Gitlab Repo here.

Conclusion

Cloudy SQL is a pandas and Jupyter extension that manages the Snowflake connection process and provides a simplified and streamlined way to execute SQL in Snowflake from a Jupyter Notebook. This tool continues to be developed with new features, so any feedback is greatly appreciated. Be sure to check out the PyPi package here!

Ready to Accelerate Your Digital Transformation?

At Hashmap, we work with our clients to build better together.

If you are considering moving data and analytics products and applications to the cloud or if you would like help and guidance and a few best practices in delivering higher value outcomes in your existing cloud program, then please contact us.

Hashmap, an NTT DATA Company, offers a range of enablement workshops and assessment services, cloud modernization and migration services, and consulting service packages as part of our data and cloud service offerings. We would be glad to work through your specific requirements.

Feel free to share on other channels, and be sure and keep up with all new content from Hashmap here. To listen in on a casual conversation about all things data engineering and the cloud, check out Hashmap’s podcast Hashmap on Tap as well on Spotify, Apple, Google, and other popular streaming apps.

Other Tools and Content For You

Sam Kohlleffel is in the RTE Internship program at Hashmap, an NTT DATA Company. When he’s not developing data and cloud applications, he’s studying Economics, Math, and Statistics at Texas A&M University.

--

--