How We Automated Snowflake Data Profiler

with a combination of GitHub, unit testing, and CircleCI

Sam Kohlleffel
Hashmap, an NTT DATA Company
8 min readSep 21, 2020

--

To kick things off, if you haven’t checked out Snowflake yet, you definitely should. We get more requests from customers implementing Snowflake than all other cloud data platforms combined.

Snowflake enables a wide range of data workloads including data warehousing, data lakes, data engineering, data science, and more. Even better, it does this as a true SaaS service and requires virtually no maintenance and operational overhead.

With all the market demand that Snowflake is experiencing, as a Premier Partner, it makes sense to focus on adding even more value to the overall customer experience — that’s where Snowflake Data Profiler comes in.

Snowflake Data Profiler is a free, open-source tool that profiles the first 10,000 rows of any Snowflake table. Snowflake Data Profiler became generally available in early August and is accessible for free public use.

As with any deployed application, there are always improvements that can be made and features that can be added. One way to make changes to a deployed app is to SSH directly into the VM instance and make the changes there. From personal experience, I can tell you this method is a terrible way to update your deployed application.

It presents two major problems:

  1. It is manual, error-prone, and inefficient.
  2. The deployed application breaks if you make a mistake (which is often in my case).

The question then becomes — “how does a developer securely and efficiently make changes to a deployed application while ensuring at the same time it won’t break?”.

This is where automation comes into play. Automation is critical to the process of securely making changes to an application without ever having to manually access the running processes in the VM instance. Automation was a foreign concept to me only a few months ago.

Let’s Look at Snowflake Data Profiler’s Automation

There are different paths that one can take when automating an application. For the purposes of this article, I am going to talk about the methods and technologies we used to automate Snowflake Data Profiler.

Snowflake Data Profiler Automation Flow

GitHub

GitHub is a free hosting service for software development that allows the use of version control.

Version control is extremely important for automating an application. It is a method of development in which programmers make changes and add features to a project on a separate, private “branch” of an application. This separate version of the application has no effect on the main, working version of the application. This means that even if changes or added features are made to the separate version of the code, the main version of the project’s code remains untouched. Below is an example of how version control in GitHub works.

An example of GitHub version control

In GitHub, version control manifests with the use of branches. Generally, a project has a main, functioning version of the application on a branch called “master”. This branch is never directly changed. Instead, feature branches are used to make changes to the master branch. Feature branches start as a copy of the master branch by pulling its code. Then, additions or changes can be added and tested on these feature branches in isolation without changing the code of the master branch.

Changes are only made to the master branch when a pull request is made and approved. A pull request merges the additions of the feature branch with the existing master branch code. This means that the master branch now contains the additions of the feature branch. This process is repeated until all desired features are added to the application.

Developing with version control allows developers to safely add additions to their project without worrying about the main version of the code-breaking.

Unit Testing

Unit testing is an integral part of automation. Without tests, a developer cannot be completely certain that their code will work correctly once deployed.

Because Snowflake Data Profiler has a python-based backend, we use a combination of the built-in Unittest library and the Pytest library, which we install through pip. The accepted way to test code is to write tests for different functionality in the project. With Pytest, the only command that is needed to run the tests in the terminal is pytest.

In order to find and run the tests, Pytest looks for a specific naming convention. Every Python file that contains tests needs to have the following name structure: test_ + desired_filename.py. The desired filename can be anything. As long as the filename starts with test_, Pytest will know that there are tests in the file that need to be run. Within the file, the test function names need to start with test_ as well.

In the Snowflake Data Profiler project files, we have a test file that has the name test_profiler.py and a function within that file has the name test_get_profile. The test_profiler.py file contains tests for all of the functions found in our file profiler.py and the test_get_profile function contains tests for the get_profile function found in profiler.py.

CircleCI

CircleCI is a continuous integration and continuous delivery platform. This is where the automation process all comes together. A config.yml file is needed for CircleCI to work. Here is the config.yml file for Snowflake Data Profiler.

Complete config file for Snowflake Data Profiler

Config files in CircleCI are used to designate jobs and workflows for project automation. Our config.yml file contains three jobs and one workflow. There are steps that are defined within each job. These steps serve as instructions for each job. When a job is called in a workflow, the job’s steps are executed. These three jobs within our workflow take the master branch code from GitHub and use that code to update our EC2 VM instance in a matter of minutes.

Job 1: docker/publish

As you can see, we use a CircleCI orb in our config.yml file. A CircleCI orb is an open-source package of pre-built commands and jobs.

The Docker orb we use for our config file

Specifically, we use a pre-built docker orb, circleci/docker@1.3.0, that contains a job for publishing images to DockerHub. Under workflows and jobs, we use the docker/publish job that is pre-defined in the orb we use.

docker/publish job defined in our workflow

This job takes any feature GitHub branch under PR in the Snowflake Data Profiler repository and updates our Docker image with the new code. Under the docker/publish job, we define the image we want to update and the tag for the new image version.

Job 2: test_profiler

The test_profiler job is a custom job that we build in the config.yml file.

Our test_profiler job defined in the config file

We first designate the Docker image to use. Notice that it is the Docker image we build in the first job with the $CIRCLE_SHA1 tag. We then cd into our directory and run our unit tests with the pytest command.

docker/publish and test_profiler jobs defined in our workflow

When we call the job in the workflow, we require that the docker/publish job is run before this job is executed. This ensures that the test_profiler job is using the most recent Docker image of Snowflake Data Profiler when it runs. After the test_profiler job is run, CircleCI notifies us of the results.

Job 3: deploy_profiler

The deploy_profiler job is another custom job that we define in the config.yml file.

Our deploy_profiler job defined in config file

This job runs steps that SSH into the EC2 VM instance. Once inside the VM, a command is run that takes down the Docker container that the VM is currently running on and replaces it with a container based on the latest Docker image that was just created.

docker/publish, test_profiler, and deploy_profiler jobs defined in our workflow

Notice that this job is only called in the workflow if the feature branch pull request is accepted and the master branch is the branch that is changed. This step also requires that the test_profiler step passed. If the new image did not pass the unit tests, this step will not run and the older Docker image will remain in the EC2 instance.

With CircleCI and a config.yml file, the Snowflake Data Profiler deployment process is fully automated.

Closing Thoughts

Without the use of GitHub, unit tests, and CircleCI, it would be nearly impossible for us to continuously add updates to Snowflake Data Profiler. GitHub allows us to efficiently make additions to the project files without worrying about getting it right on the first try. Unit tests ensure that the additions we make do not break the application. CircleCI brings the entire automation process together. All we have to do is approve a PR in GitHub and, assuming all of the build jobs pass, CircleCI automatically updates our deployed application for us.

Automation makes updating a deployed application extremely reliable and repeatable. It removes the unnecessary, manual challenges of the deployment process. By applying automation to Snowflake Data Profiler, we are able to quickly get new features and minor changes out of production because even if those changes do not end up working, the automated tests we created catch the failure.

This allows us to initially focus on adding changes and value to an application rather than spending long hours worrying about how to correctly implement those changes in deployment. Also, if we decide to remove a recently added change, CircleCI allows us to roll back the application to a previous version before those changes were implemented.

Deploying an application always comes with challenges. However, after setting up automation for Snowflake Data Profiler, we don’t have to worry about the headaches that come with deploying additional changes.

Ready to accelerate your digital transformation?

At Hashmap, we work with our clients to build better, together.

If you want to go beyond the blogs and get to the heart of how you can pour jet fuel into your data and analytics tank, please reach out and let us take it from here.

--

--