How to Get a Snowflake Data Quality Assessment in 60 Seconds

Use Snowflake Data Profiler for fast, statistical DQ analysis of Snowflake tables

Sam Kohlleffel
Hashmap, an NTT DATA Company
8 min readAug 18, 2020

--

What is Data Profiling?

The organization and preservation of your data are essential and when you’re using the Snowflake Data Cloud you’ll find it’s organized by warehouse, database, schema, and table or view. Once your data is placed and organized, however, it can be a challenge to keep track of and monitor your tables’ data quality. To get the most value out of Snowflake (and all data platforms), we suggest using data profiling as a starting point.

Data profiling is an approach for providing automated, in-depth analyses of data quality and for identifying relationships in your data that aren’t always obvious at first glance. It is a great way to find quality issues at their source and is a common tool for jumpstarting advanced analysis of new data sets.

Snowflake Data Profiler

Snowflake Data Profiler is a simple, open-source tool that can generate a report of any one of your Snowflake tables (or views) in a single click.

We chose to build utilities for Snowflake for a number of reasons, not the least of which is that Hashmap is getting more requests to implement Snowflake for customers than all other cloud data platforms combined, and for good reason.

  1. Snowflake just works, is SQL-first, and requires very little day-to-day management and operational overhead
  2. The virtual warehousing and independent compute is unmatched with instant up/down elasticity and the required cost controls to go along with pay-by-second pricing
  3. With Snowflake, you have your choice of all 3 major public clouds — AWS, Azure, and GCP

How does Snowflake Data Profiler work?

Snowflake Data Profiler is a Python-based tool that leverages:

Connecting to the Snowflake Database

First, the user posts their Snowflake information to a front-end webpage form. Once the post request makes its way to the Flask framework, we use Python to handle the Snowflake user data.

To access the specified Snowflake table, the Profiler needs to first determine whether the given user information is correct. This is done by creating a connector object with the snowflake-connector-python module.

For our purposes, the connector object requires the user’s Snowflake username, password, account, database, and schema. Role is an optional argument.

The connector object establishes a connection with Snowflake. However, if the connection fails, the user is notified that his or her submitted Snowflake information is incorrect.

After the connection is established, a cursor object is created from the connector. The cursor points to the table in Snowflake where the ‘execute’ and ‘fetch’ commands should run.

The cursor object requires the user’s Snowflake database, schema, and table that they would like to profile. Snowflake warehouse is an optional argument.

Once the cursor is created, a SQL command, which selects the Snowflake table from the database, is executed. A built-in cursor command is then used to fetch the Snowflake table and convert it into a pandas data frame.

The cursor object allows us to not only query a table from the Snowflake database but also convert the table into a pandas data frame.

Leveraging Technical Debt

The Profiler can generate a report on any table regardless of size, but it currently only looks at the first 10,000 rows. We have understood since the beginning of our development that the 10,000-row limit was a temporary fix.

Our goal is to get the working Snowflake Data Profiler out the door, then to focus on scalability.

Using this method, we are able to provide users a working version of the application and measure how demand impacts backend performance. As more users profile their data and as we implement more advanced autoscaling, we can afford to remove these restrictions.

We are currently working to improve upon the row limit through the implementation of more advanced load scaling with Kubernetes. This means that Snowflake Data Profiler should have the option to greatly increase the row limit in the near future.

Generating the Profile Report

Now that our tool has a pandas data frame, the only thing left to do is create the profile report. Rather than building a custom profiling tool, the Snowflake Data Profiler utilizes pandas-profiling.

This is another example of how we leverage technical debt to our advantage. While we could build our own profiling tool, it would most likely take a while to develop. There are already viable options for completing this kind of task. Below is an example of how easy it is to profile a pandas data frame using pandas-profiling.

The ProfileReport function from pandas-profiling gives the user many parameters to work with. All that is required to run the report is a pandas data frame. Our pandas data frame is stored in the variable data.

We choose to expand upon the default report slightly. This includes changing the title of the report, setting explorative to True and configuring which correlations we need.

Pandas-profiling is an amazing tool that allows you to take any pandas data frame and run detailed statistical analysis on it. Click here to see a sample pandas-profiling report.

Pandas-profiling gives an in-depth analysis of any pandas data frame it is given. Once we generate the pandas-profiling report, our tool then returns the report to the user as an HTML response.

Moving the App into Production

Hosting the Application

We use an AWS EC2 instance to host our application.

To run Snowflake Data Profiler in our EC2 instance, we utilize Docker and Docker-compose.

We use both the latest Snowflake Data Profiler image and an Nginx image from DockerHub in our docker-compose file.

As you can see in our docker-compose.yml, we pull the latest version of the Snowflake Data Profiler image. We also pull a pre-existing Nginx image from DockerHub and set it to depend on the Snowflake Data Profiler image.

The docker-compose.yml along with the other project files are copied to the EC2 instance. Then, once we are inside our EC2 instance, a docker-compose up command is run to deploy our app on the EC2 VM.

Securing Our Application

We provide a secure connection on the Snowflake Data Profiler browser using Cloudflare and HTTPS certificates. This means that your Snowflake information is guaranteed to be encrypted and secured in transit to the EC2 web-server.

We never store your connection details or Snowflake data. Our Flask API is entirely stateless and forgets all of your details as soon as it finishes a request. This entirely removes the concern that one of our data stores may become compromised.

Handling Post Data with WSGI

To handle web requests from the browser, Snowflake Data Profiler utilizes Nginx, Gunicorn, and the Flask framework.

Once the user submits their Snowflake information, the post request is sent from the front-end webpage to the Nginx server. The Nginx server then decides what to do with the post request.

Nginx is a fast and effective way to manage web traffic. It can handle static content quickly while it sends dynamic requests further down the line. In our case, Nginx sends the request to a Gunicorn WSGI (Web Server Gateway Interface).

Gunicorn serves as a necessary go-between for the Nginx server and the Flask framework. Sending the post request to the Gunicorn server allows Flask and Python to serve incoming requests. Once the request is processed and the Snowflake table is profiled, Snowflake Data Profiler sends a profile report back to the user as an HTTP response.

These are sample screenshots of a profile report. The report makes its way back to the browser through Flask, Gunicorn, and Nginx in a matter of seconds.

Final Thoughts

Snowflake Data Profiler is a safe and secure way to quickly gain a greater understanding of both the data quality and hidden relationships of your Snowflake data.

All it takes is a single click and you can profile the first 10,000 rows of one of your Snowflake tables. You can use it as many times as you’d like on as many tables as you’d like. Be sure to checkout Snowflake Data Profiler. It’s quick, open-source, secure, and easy to use!

--

--