Unveiling Fraud in Digital Payments with Snowflake ML Functions

Published in

Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

8 min readAug 8, 2024

Digital payments and financial fraud (Photo Credit: Unsplash)

The age of the digital era has witnessed numerous advancements since its inception. Digital payments have become a cornerstone of modern commerce. From online shopping to mobile banking, the convenience of making transactions with a few taps or clicks has transformed consumer behavior. Online platforms and mobile applications have made digital payment methods more accessible and convenient, leading to a significant increase in the volume of transactions conducted electronically. Each of us has likely used at least one of these technological innovations, ranging from digital wallets and contactless payments to blockchain, cryptocurrencies, peer-to-peer payment apps, and maybe even a voice-activated payment.

With all these good things from technological advancement perspective also comes the Dark Side: Financial Fraud. The increase in online transactions has given rise to financial fraud, particularly in credit card transactions, and fraudsters have become more sophisticated, employing advanced techniques to exploit vulnerabilities in payment systems, causing financial losses for billions of dollars globally.

The Power of Snowflake for Seamless Fraud Detection

Snowflake offers unparalleled capabilities that can transform fraud detection. This blog will walk you through how leveraging the Data Cloud’s machine learning capabilities can transform your fraud detection strategies, making them more efficient, accurate, and scalable. This not only helps reduce losses but also enhances customer trust and satisfaction.

Types of Fraud Detection Methods

There are different types of fraud detection methods, including rule-based detection methods, supervised ML models like support vector machines (SVM), random forests, Neural Networks that are trained on unlabeled datasets, and AI-assisted pattern detection, which are supported by Snowflake natively. We will be leveraging the Snowflake ML Function powered by a gradient boosting machine that can carry fraud detection in near-real time and build a binary classification model.

The Power of Snowflake ML Functions

The advantage with these powerful analysis ML Functions is that they give you automated predictions and insights into your data using machine learning, and one need not be a machine learning expert to take advantage of them. Snowflake provides an appropriate type of model for each feature, so you don’t have to be a machine learning expert to take advantage of them. For businesses needing to process vast amounts of data in real-time, there are various options that include integrating a real-time pipeline inside Snowflake’s managed container offering Snowpark Container services.

The Snowflake Classification ML Function expects a dataset that includes a target column representing the labeled class of each data point and at least one feature column. There us a flexibility to provide both string, numeric, Boolean type features and these are handled accordingly by Snowflake.

Let’s dive into the build. Below is the reference architecture of the end to end pipeline that we will be building.

Figure 1: End-to-end Architecture using Snowflake ML for Fraud Detection

Setting Up the Environment

To begin, we need to set up our Snowflake environment, including the database, schema, and virtual warehouse in a Snowflake Notebooks. Snowflake Notebooks is a development interface in Snowsight that offers an interactive, cell-based programming environment for Python and SQL.

Since we will be using the Snowflake Python API , we will use the get_active_session() method to get the active session context. Notebooks come with a ton of features that is outside the scope of this blog, and I highly encourage you to read the Snowflake documentation.

Loading and Preparing Data Next, we will load transaction data from a CSV file into Snowflake and create a snowpark data frame for processing it further using the Snowpark Python Library. For the purpose of attaining an approximation of reality, a simulated dataset was created using a custom Python function and loaded in an external Snowflake stage. Other factors that should be taken into account while preparing the data should address concerns from class imbalance (less than 1% of fraudulent transactions), a mix of numerical and categorical features (high cardinality), non-trivial relationships between features, and time-dependent fraud scenarios.

When constructing a credit card fraud detection model, it is very important to use those features that allow accurate classification, and hence we have pre-applied techniques to handle. This has rules to generate transactions and fraudulent behaviors to help the model interpret the kind of patterns from attributes like:

Number of clicks
Number of pages visited
Time Elapsed
Transaction Amount
Location
Merchant

Creating and Managing the Feature Store A feature store in Snowflake enables the management and retrieval of features for data science and machine learning workloads easier and more efficient. The feature entities are the underlying objects that features and feature views are associated with. They encapsulate the join keys used for feature lookups.

Feature Engineering

Feature engineering is a vital part of building high-quality machine learning application where the raw data is transformed into features that can be used to train machine learning models. We perform feature engineering to create meaningful features from the customer behavior and transaction data for our machine learning model. Various features related to user spending patterns are generated. These features include weekly, monthly, and yearly spending, as well as transactional statistics that can be used for further analysis or machine learning tasks such as fraud detection.

Figure 2: Generated Customer Dataset

The transaction dataset contains generated features related to the nature of the transactions.

Figure 3: Generated transactions dataset

Building Feature Views Feature views help in managing and reusing features efficiently. It acts as a wrapper by encapsulating a pipeline for transforming raw data into one or more related features that are refreshed from the data source at the same time. Multiple versions of feature views to capture point in time trends can be created and the relevant ones should be consumable during the model Training.

Model Building and Prediction

Since the heavy lift is done by Snowflake, an analyst can jump in directly and create a new classification model or replaces an existing model in the current or specified schema.

Use the CREATE SNOWFLAKE.ML.CLASSIFICATION to create and train a model.

#Split the transaction history into training and validation dataset after balancing and scaling. The Trainingfdtable contains the training data 
session.sql("CREATE OR REPLACE VIEW fraud_classification_training_view AS SELECT IS_FRAUD,LATITUDE,LONGITUDE,LOCATION,TOTAL_TRANSACTIONS,STDDEV_TRANSACTION_AMOUNT,NUM_UNIQUE_MERCHANTS, MEAN_WEEKLY_SPENT,MEAN_MONTHLY_SPENT,MEAN_YEARLY_SPENT,TIME_ELAPSED,CLICKS,CUMULATIVE_CLICKS,CUMULATIVE_LOGINS_PER_HOUR FROM training_fd_table").collect()

CREATE OR REPLACE SNOWFLAKE.ML.CLASSIFICATION fraud_classification_model(
    INPUT_DATA => SYSTEM$REFERENCE('view', 'fraud_classification_training_view'),
    TARGET_COLNAME => 'IS_FRAUD'
);
#View the classification model, use the SHOW command.
SHOW SNOWFLAKE.ML.CLASSIFICATION;

To run inference (prediction) on a dataset, use the model’s PREDICT method. The results are saved to a table.

# The fraud_classification_model contains the validation data 

CREATE OR REPLACE TABLE fraud_predictions AS
SELECT *,fraud_classification_model!PREDICT(INPUT_DATA => object_construct(*)) as predictions
from fraud_classification_val_view;

The model returns output as a prediction object which includes predicted probabilities for each class and the predicted class based on the maximum predicted probability. The predictions are returned in the same order as the original features were provided.

Understanding evaluation metrics

In machine learning, metrics are essential for evaluating how accurately a model predicts new data. After training a model with the transactional dataset, it is crucial to assess its performance and outcomes. Various evaluation metrics are available, such as Recall (RC), F1-score, Precision, Area Under the Curve (AUC), Log Loss, and Root Mean Squared Error (RMSE). However, within the realm of credit card fraud detection, certain metrics are typically favored due to the inherent imbalance in the datasets.

For evaluating the performance of the model, the following metrics have been used to determine the overall model performance:

show_evaluation_metrics() :

— Precision: The ratio of true positives to the total predicted positives.

— Recall (Sensitivity): The ratio of true positives to the total actual positives.

— F1 Score: The harmonic mean of precision and recall.

show_threshold_metrics() : Provides raw counts and metrics for a specific threshold for each class.
show_confusion_matrix(): A table used to assess the performance of a model by comparing predicted and actual values and evaluating its ability to correctly identify positive and negative instances.
show_feature_importance(): Represents an approximate ranking of the features in your trained model by counting the number of times the model’s trees used each feature to make a decision.

Build a Fraud Detection App for GeoSpatial Analysis in Snowflake Streamlit

At this stage of the project, we have completed building our end-to-end pipeline and are ready to use the trained model for detecting fraudulent patterns in new transactions. We will build a Streamlit app in Snowflake to deliver a fraud detection data app for end users to easily understand the nature of incoming new transactions.

This indicates the richness of our data for effective analyses, detection, and prediction of fraudulent transactions. The feature importance shows that the merchant, location, time elapsed, and clicks were necessary factors in detecting and predicting fraudulent transactions.

Clip 1 : Geospatial Analysis powered Fraud Detection App in Snowflake Streamlit

Interested in trying this out yourself! Here is the link to the Quickstart that contains the instructions and the code.

Conclusion:

The key to a successful Digital transformation strategy is a Data strategy. And there is no AI or ML strategy without a Data strategy. Maintaining high-quality and consistent data is essential for effective fraud detection. In this blog, we demonstrated an approach to building an automated fraud detection model to monitor all incoming transactions and continuously score them. This system can be expanded to include automated alerts when fraudulent transactions are detected.

Overall, we implemented an end-to-end pipeline comprising data ingestion, feature engineering, model building, and a data application to tackle the nuances of fraudulent transaction identification within the secure boundaries of the Snowflake Data Cloud. This framework can be extended to various other applications, providing a scalable and adaptable solution for fraud detection across different sectors. By leveraging Snowflake’s powerful capabilities, businesses can stay ahead in the ongoing battle against fraud, ensuring enhanced security and trust in their financial systems.

Unveiling Fraud in Digital Payments with Snowflake ML Functions

Build a Fraud Detection App for GeoSpatial Analysis in Snowflake Streamlit

Conclusion:

Written by KalaGovindarajan