Picture by Tyler Tornberg in unsplash

Amplify Machine Learning with Snowflake

R S
R S
May 23 · 8 min read

Learn how machine learning models can leverage Snowflake’s abilities to ramp up your analytics, predictions, and provide an efficient enterprise platform for your organization.

By Yuvraj Sidhu, Ricky Sharma

Let’s start with Artificial Intelligence and Machine Learning:

Big Data and Analytics have been hot buzzwords within the last 5 years according to Forbes. Artificial Intelligence and Machine Learning have generated much more enthusiasm and are many times incorrectly used interchangeably.

Artificial Intelligence (AI) is a wide-ranging concept that describes machines carrying out tasks in a “smart” way. It has been around for centuries; for example, Greek myths document stories of mechanical men constructed to copy human’s behavior.

Machine Learning (ML) is an applied solution of AI, where machines are given access to the data. This empowers them to self-train and learn from this data, to produce insightful results. It is a modern approach, within the last century, that uses algorithms and increasingly complex calculations, based on example data, to create a generalized solution. A resulting solution requires numerous training iterations of ML model. These ML models, classification and regression at a high-level, are tools that are applied on the data for producing insightful results.

2019 trending Machine Learning techniques are:

1. Generative Adversarial Networks (GAN) is a ML technique to generate data that is indistinguishable from original data.

Sample use cases that apply GAN are:

· Privacy Preserving: A mechanism to identify sensitive data from your datastore, in order to provide only the non-sensitive data to external parties. See here for examples on Privacy Preserving.

· Anomaly/Outlier Detection: An automated way to analyze patterns that do not conform to expected behavior. See here for examples on Anomaly/Outlier Detection.

2. Reinforcement Learning (RL) uses sequential decision-making tasks to construct a mathematical framework for solving problems.

Sample use cases that apply Reinforcement Learning are:

· Multi-Agent Systems: A framework employed to control multi-intersection networks using multiple agents that learn by dynamically interacting with their environment. See here for examples on Multi-Agent systems

· Recommender Systems: A system that seeks to predict the “preference” of a user using collaborative filtering. See here for examples on Recommender Systems.

· Gaming and Deep RL: A technique to train an agent and create an opponent that imitates human-like behavior. See here for examples on Gaming and Deep RL

3. Natural Language Processing (NLP) accelerates analytics to transform unstructured text into impactful insights and relevant data.

Sample use cases that apply NLP are:

· Chatbots: AI that has a goal to interpret, recognize, and understand users’ request in free-formed text. See here for examples on Chatbots.

· Sentiment Analysis: Assigning a value (i.e. emotion, position) to text, to extract opinions from text. See here for examples on Sentiment Analysis.

· Risk Assessment: Understanding risk using AI-based credit scoring, advanced fraud prevention, and more effective customer assessment. See here for examples on Risk Assessment.

Growing data needs an efficient architecture for ML:

The digital age is rapidly boosting, increasing data year over year. Performing analysis against such gigantic data assets requires high performant tools and platforms. Such platforms must handle large scale data volumes and provide deep processing horsepower to efficiently support data analysis and machine learning use cases. Hence, we need a solid sustainable architecture to collaborate all the technology practices, including: Data Management, Analytics, Machine Learning, and Visualization. This architecture should provide enterprise grade delivery layers, serving all the user base and their use cases.

Solving the HOW: Yes, a data processing platform will sustain your ML use cases!

As explained above, for a sustainable environment, data processing and machine learning must be collaborated. Here are the building blocks for this architecture:

Yes, Too many cooks will spoil the broth! Each block in the above diagram holds a specific purpose. If blocks are combined, then it raises sustainability and adoptability concerns. Here is the purpose served by each block:

1. Design and Discover:

a. Gather data: Data collection layer brings all the raw data from various source systems into a single storage layer, for normalization and consolidation.

b. Transform and manage data: This is the data management layer to transform RAW source data into presentable data assets, enabling data as a service capability. This empowers consumers such as data scientists, analysts, reporting applications and even machine learning models to focus on their strengths, rather than wrangling raw / unstructured data, which is a common architectural bottleneck.

c. Apply ML Model: This block applies machine learning algorithms to the use case and timeline. It is critical to determine the key questions that will affect the outcome of the model. Also, it may be helpful to compare the value of time between building your own model or using a pre-built model if it is available. Lastly, a crucial piece of information is to make sure the model meets the business objective in order to make the investment of time and resources during the training period.

2. Develop Machine Learning Models:

a. Train and Evaluate: When implementing machine learning algorithms, determining how your outputs are compared or categorized is a necessary step. You must carefully select the evaluation and validation criterias. These criterias must fit the business goal . An optimal solution is to test the same algorithm on various sizes of your datasets, and split it between training and test sets in order to check actual vs. predicted values. There are multiple processes to achieve these values and the next step becomes logical to identify the appropriate variables.

b. Hyperparameter Tuning, Error Analysis: This is an iterative technique, in conjustion with 2.a, to select the best combination of parameters for a learning algorithm. The process can require specific industry expertise and can be extensive, still there are common parameter tuning methods, such as grid search, random search, and Bayesian optimization.

c. Store Predictions: Once error is minmized using gradient descent algorithm, ML outputs can integrate back into the same data storage layer, Snowflake. This helps bring uniformity in the data assets and structures the data publishing requirements.

3. Analyze and Report:

These represent data analysis and visualization layers to help derive insightful analysis and conclusions from your enterprise data assets.

What to expect with Snowflake and Machine Learning integration?

A toilsome part of data science is data wrangling. Once this arduous piece is done, which could take most of the ML engineers’ time, then ML engineers can spend time in creating robust models. Provisioning Snowflake as Data Management layer matures the architecture by providing unlimited storage and dynamically scalable compute features. Snowflake provides a consistent and scalable data delivery layer for all your use cases, user base and applications. As Snowflake seeks to enhance performance by integrating different structured and semi-structured data sets, it becomes much more flexible to source data in a machine learning standpoint.

Faster performance is a key factor in enabling more robust machine learning models. For a machine learning model to run and produce accurate predictions, on both training and test data sets, the capability is there in Snowflake to either scale up or scale down. ML tools can even also leverage “push down” methodology, which sends the data preparation workloads to Snowflake, using its scalable compute power for efficiency. This reduces the data processing burdens from ML tools and helps them expand focus towards machine learning scenarios.

Machine Learning with SageMaker:

Here are some key factors why Amazon SageMaker enhances the solutioning of machine learning models:

· SageMaker allows engineers and developers to build, train, and deploy machine learning models.

· These software solutions can either be custom-built or purchased production-ready from AWS Marketplace.

· SageMaker provides a fully managed service to deploy and run models in a secure and scalable environment.

· It leverages distributed processing to optimize machine learning models that fit general industry use cases.

· Users can swiftly consume metrics from model training, evaluation, and validation.

How to connect SageMaker with Snowflake:

SageMaker can be easily integrated with Snowflake, which takes away the heavy lifting of machine learning and improves the data integrity as Snowflake will maintain a single source of truth for all the data. Here are the key steps to integrate them:

Example SageMaker Outputs:

This GitHub link — Portfolio Management with Amazon SageMaker RL — presents an example in Amazon SageMaker to run a ML model using the Reinforcement Learning (RL) technique. In the ‘Store intermediate training output and model checkpoints’ section, it is noted that the outputs can be stored on Amazon S3. However, ML outputs can also be maintained in SnowFlake, expanding its data accessibility.

Another reference on configuring Amazon SageMaker for Reinforcement Learning to solve the cartpole problem. Cartpole is a game with the goal of keeping the cartpole balanced by applying appropriate forces to a pivot point.

Closing notes:

Snowflake’s technology combines the power of data warehousing, the flexibility of big data platforms and the elasticity of the cloud. This empowers data consumers, such as Machine Learning, to easily collaborate with Snowflake and derive accurate results in a lower cost and time.

Resources:

· Looking for more Snowflake expertise? Lets talk.

· Looking for more SageMaker examples or expertise? See Slalom Technology , GitHub, AWS Docs.

About Authors:

Ricky Sharma is a Solution Architect for the Data And Analytics Practice at Slalom Consulting’s New York Office (a full service business and technology consulting firm) and is passionate about learning new tools, technologies, and skill set to implement best in class solutions that solves challenging data problems. I love to combine technical and functional aspects in my day to day work, that helps me seamlessly blend innovative ideas, attitude and technologies; driving a positive change in people and processes.

Yuvraj Sidhu is an Analyst for the Data and Analytics Practice at Slalom Consulting’s New York Office. He is excited about predictive and big data analytics technologies to drive insights on enterprise data and continuously advance his knowledge of data science. He is passionate about applying solutions with machine learning, optimization, and agile projects. He also mentors young students aspiring to work in STEM industry, enjoys playing basketball and tennis, and learning trending Internet of Things (IOT) applications.