Load Data from S3 to Snowflake and Use TensorFlow Model

Dash Desai
Mar 11 · 5 min read

Learn how to load data from S3 to Snowflake and serve a TensorFlow model in StreamSets Data Collector, a fast data ingestion engine , data pipeline for scoring on data flowing from S3 to Snowflake.

Data and analytics are helping us become faster and smarter at staying healthy. Open data sets and analytics at cloud scale are key to unlocking the accuracy needed to make real impacts in the medical field. Data Clouds like Snowflake are prime enablers for opening up analytics with self-service data access and scale that mirrors the dynamic nature of data today. One prominent pattern and a key initial step is loading data from S3 to Snowflake to get quick value out of your data cloud. However, the value of your Snowflake Data Cloud may never be realized if you can’t load it full of your relational and streaming sources. Smart data pipelines let you send and sync your data from all corners of your business and land it in your data cloud in the right format, ready for analytics. This means you can provide production ready data for data science and machine learning.

Use Case

The use case we’re going to review is of classifying breast cancer tumors as or and to demonstrate we’ll be using the Wisconsin breast cancer dataset made available as part of scikit-learn . To learn how I’ve trained and exported a simple TensorFlow model in Python, checkout my code on GitHub . As you’ll note, the model architecture is kept to a minimum and is pretty simple with only a couple of layers. The important aspect of the code to pay attention is how the model is exported and saved using TensorFlow SavedModelBuilder *.

*Note: To use TensorFlow models in Data Collector, they should be exported/saved using TensorFlow SavedModelBuilder in your choice of supported language such as Python.

What is TensorFlow?

TensorFlow is an open source machine learning framework built for Deep Neural Networks. It was created by the Google Brain Team. TensorFlow supports scalable and portable training on CPUs, GPUs and TPUs. As of today, it is the most popular and active machine learning project on GitHub.

Data Pipeline Overview

Here’s the overview of the data pipeline.

  • Ingest breast cancer tumors data from Amazon S3
  • Perform transformations
  • Use TensorFlow machine learning model to classify tumor as benign or malignant
  • Store breast cancer tumors data along with the cancer classification in Snowflake

Load Data From S3 to Snowflake

Amazon S3

  • The origin will load breast cancer tumor records stored in a .csv file on Amazon S3. Some of the important configuration include:
  • AWS credentials — you can use either an instance profile or AWS access keys to authenticate
  • Amazon S3 bucket name
  • Object/file name or patterns of file names that contain the input data to load from Amazon S3.

Field Converter

  • This processor will convert all input columns from String to Float. In this case, all fields need to be included in the list since they are all features that will be used by the TensorFlow model.

TensorFlow Evaluator

Field Renamer

  • This processor will enable us to rename columns to make them a little more self-explanatory.

Expression Evaluator

  • This processor will use the expression ${record:value(‘/classification’) == 0 ? ‘Benign’ : ‘Malignant’ } to evaluate ‘condition’ column value and create a new column with values ‘Benign’ or ‘Malignant’ depending on the tumor classification.

Field Remover

  • This processor will enable us to remove columns that we don’t need to store in Snowflake.

Snowflake

  • The transformed data is stored in Snowflake. Some of the important configuration include:
  • Snowflake Account, Username, Password, Warehouse, Database, Schema and Table name. Note: The pipeline will automatically create the table if it doesn’t already exist. So there’s no need to pre-create it in Snowflake.
  • When processing new data, you can configure the Snowflake destination to use the COPY command to load data to Snowflake tables.
  • When processing change data capture (CDC) data, you can configure the Snowflake destination to use the MERGE command to load data to Snowflake tables.

Data Pipeline Execution

Upon executing the pipeline, the input breast cancer tumor records are passed through the data pipeline outlined above including real-time scoring using the TensorFlow model. The output records sent to Snowflake data cloud include breast cancer tumor features used by the model for classification, model output value of 0 or 1, and respective tumor condition Benign or Malignant. See below.

Conclusion

The integration between StreamSets and Snowflake enables data engineers to build smart data pipelines to load data synchronously as well as asynchronously. A good example of synchronous load is multi-table, real-time replication using change data capture (CDC) and for asynchronous, high-throughput, and pure-inserts type of workloads data engineers can leverage Snowpipe integration in StreamSets .

Get started today by deploying a Data Collector, fast data ingestion engine , on your choice of cloud provider or you can also download it for local development.

Originally published at https://streamsets.com on March 11, 2021.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…

Sign up for Analytics Vidhya News Bytes

By Analytics Vidhya

Latest news from Analytics Vidhya on our Hackathons and some of our best articles! Take a look.

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store