Cloudy Warehouses — Simplifying Snowflake & pandas DataFrames Interaction

A pandas library extension designed for streamlined interaction between Snowflake and pandas DataFrames

Sam Kohlleffel
Hashmap, an NTT DATA Company
6 min readJan 15, 2021

--

Background

Snowflake’s Data Cloud enables us to run six core workloads as part of the service, including Data Engineering, Data Lake, Data Warehouse, Data Science, Data Apps, and Data Sharing. For this article, I’ll focus on two core workloads — Data Science and Data Warehouse.

Snowflake users store their data in tables and can transform the uploaded data as they see fit. One popular way for Python developers and data scientists to interact with Snowflake tables is using pandas DataFrames and the Snowflake Connector for Python.

With the Snowflake Connector, developers can establish a direct connection with Snowflake from their local machines. They then can interact with their Snowflake environment. Using the Snowflake Connector isn’t difficult, but it takes time to learn, as with any new library.

Every time you want to perform basic interactions with Snowflake from your local machine, the Connector requires that you provide your account credentials and connection details for each project. From personal experience, I can tell you this process becomes tedious. Thankfully, there is now a tool we developed that addresses this problem.

Cloudy Warehouses — a pandas extension for Snowflake

Cloudy Warehouses is a pandas extension that manages the Snowflake connection process and provides simplified interaction methods between Snowflake and pandas DataFrames.

Gitlab Repository

Github Mirror

Snowflake Support

The cloudy_warehouses pandas extension currently supports:

  1. Writing to an existing or new Snowflake table from a pandas DataFrame
  2. Reading from a Snowflake table
  3. Listing Snowflake tables in a database
  4. Creating a clone of an existing Snowflake table
  5. Creating an empty clone of an existing Snowflake table (clones columns, not column data)

Installation and Configuration

Follow the steps below or you can watch my walk through how to set up and use Cloudy Warehouses in this Hashmap Megabyte:

Cloudy Warehouses: Simplifying Snowflake & pandas DataFrames Interaction

To fully utilize Cloudy Warehouses, a few steps need to be completed before use:

  • Run pip install cloudy-warehouses to install Cloudy Warehouses and their dependencies:
  • Upon installation, create an empty Python file with the following two lines of code:

After you run the above Python file, a configuration file will be created in your HOME directory. Cloudy Warehouses uses the information in this file to connect to Snowflake for you.

The path to the configuration file: $HOME/.cloudy_warehouses/configuration_profiles.yml

For Windows use $USERPROFILE instead of $HOME

The configuration file has the following format:

  • Open the above file using the path provided above and fill out your Snowflake information to the applicable fields.

Note: Configuration is a one-time setup. Once you’ve configured the credentials file, you can use it for any project that uses Cloudy Warehouses.

How it works

When you call any Cloudy Warehouses method, it uses the information stored in the configuration_profiles.yml to seamlessly connect to Snowflake.

Cloudy Warehouses currently supports two options to pass in Snowflake connection credentials and details:

  1. The above configuration file. configuration_profiles.yml serves as the place to store default Snowflake credentials.
  2. Pass in your Snowflake details as arguments when calling a Cloudy Warehouses method. Any argument passed in will prioritize its corresponding default value stored in the configuration file when you use this option.

Using the Extension

To use Cloudy Warehouses in a Python file, you need the two following import statements:

import cloudy_warehouses

import pandas as pd

API

The intent has been to keep the API as simple as possible by minimally extending the pandas and Snowflake Connector APIs.

Writing to a Snowflake Table

This method allows users to create a Snowflake table and write to that table with a pandas DataFrame. Users can also use this method to append data to an existing Snowflake table.

The only required argument to directly include is table.

Role and warehouse are optional arguments that can be set up in the configuration_profiles.yml .

Username, password, account, database, and schema are all required but can all have default values set up in the configuration file.

Reading from a Snowflake Table

This method reads from a Snowflake table and returns the data in that table as a pandas DataFrame.

The only required argument to be directly passed in is table.

Role and warehouse are optional arguments that can be set up in the configuration_profiles.yml .

Username, password, account, database, and schema are all required but can all have default values set up in the configuration file.

Listing the Tables in a Snowflake Database

This method returns all of the tables in a Snowflake database as a pandas DataFrame.

There are no variables that need to be directly passed in when calling this method. All of the variables can have default values set up in the configuration file.

Creating a Clone of an Existing Snowflake Table

This method creates a clone of an existing Snowflake table. The new_table variable is the new table that will be created after the method is called. The source_table variable is the existing Snowflake table that is being cloned. The optional source_database and source_schema variables are the database and schema in which the source_table resides. If you plan to clone an existing table from the schema and database that the new_table will reside in, you do not need to include source_database and source_schema variables. The other variables can have defaults set up in the configuration file.

Creating an Empty Clone of an Existing Snowflake Table

This method creates an empty clone of an existing Snowflake table. The new_table variable is the new table that will be created after the method is called. The source_table variable is the existing Snowflake table that is being cloned. The optional source_database and source_schema variables are the database and schema in which the source_table resides. If you plan to clone an existing table from the schema and database that the new_table will reside in, you do not need to include source_database and source_schema variables. The other variables can have defaults set up in the configuration file.

Now, let’s look at an example.

Example

Example configuration_profiles.yml
Cloudy Warehouses Use Case

The above example shows how a user can leverage the read and write methods of Cloudy Warehouses after setting it up — configuration stored in: configuration_profiles.yml.

In this case, df.cloudy_warehouses.write_snowflake uses the username, password, account, database, schema, warehouse, and role stored in the above configuration file.

Meanwhile, pd.read_snowflake uses database and schema arguments that are passed in. Those two arguments will be used instead of the default values. Cloudy Warehouses will connect to Snowflake with the default username, password, account, warehouse, and role parameters stored in the configuration file.

The traditional way to do this is to manually create a Snowflake connection and run commands from that point on.

Features We Are Looking to Add

  1. Supported methods for additional cloud-based data platforms
    (BigQuery, Azure Synapse)
  2. Additional API features
  3. New connection options
    (key auth, SSO)

Conclusion

Cloudy Warehouses streamlines and simplifies interactions between pandas and Snowflake. This tool is still in development with new features to be added, so any feedback is appreciated. Be sure to check out the PyPi package here!

Don’t forget to watch my video walkthru of Cloudy Warehouses — I hope you’ll check it out!

Need Snowflake Cloud Data Warehousing and Migration Assistance?

If you’d like additional assistance in this area, Hashmap, an NTT DATA Company, offers a range of enablement workshops and consulting service packages as part of our consulting service offerings. We would be glad to work through your specifics in this area. Reach out to us here.

Other Tools and Content For You

Sam Kohlleffel is in the RTE Internship program at Hashmap, an NTT DATA Company. When he’s not developing data and cloud applications, he’s studying Economics, Math, and Statistics at Texas A&M University.

--

--