Unlock the Power of AI Optimization, Train Models 3 Times Faster.

Artem Arutyunov
The Power of AI
Published in
6 min readApr 5, 2023

Wondering how Machine Learning can help with credit card fraud detection? In this blog, we will discuss how to utilize the high-performance IBM Library Snap ML to accelerate the training of your Machine Learning models for detecting fraudulent credit card transactions.

As the world shifts towards online payment methods at a faster pace, enhancing credit card fraud detection has been a priority for all financial organizations. With the help of Machine Learning, organizations can detect credit card fraud more easily and efficiently. Want to know how? Read the blog and find out :)

To see all of the detailed explanations for the mentioned concepts and analyze/experiment with the code for this blog. Click on:

You can also take a lot of FREE courses and projects about data science or any other technology topics from Cognitive Class.

Let’s start:

Snap ML is a library for accelerated training and inference of Machine Learning models such as linear models, decision trees, random forests, and boosting machines. It’s a library developed and maintained by IBM Research. The library binaries are freely available on PyPi and it supports Linux/x86, Linux/Power, MacOS, Windows, Linux/Z. GPU support is also available for Linux. If you are curious, you can find detailed documentation and usage examples.

We will focus on training acceleration in particular by using two popular classification models to recognize fraudulent credit card transactions. They are the Decision Tree and Support Vector Machine.

Scikit-learn application can be seamlessly optimized by using Snap ML. The seamless integration of the Snap ML library is possible due to its Scikit-learn Python API compatibility.

Data:

When you are building a model that predicts if a credit card transaction is fraudulent or not, you can model the problem as a binary classification problem. A transaction belongs to the positive class (1) if it is a fraud, otherwise, it belongs to the negative class (0).

The dataset that we will be using is the Credit Card Fraud Detection dataset from Kaggle.

We will use pandas to work with the dataset itself, let’s see how it looks:

Each row in the dataset represents a credit card transaction. As shown above, each row has 31 variables. One variable (the last variable in the table above) is called `Class` and represents the target variable. Our objective will be to train a model that uses the other variables to predict the value of the `Class` variable. Let’s first retrieve basic statistics about the target variable.

Note: For confidentiality reasons, the original names of most features are anonymize as V1, V2 .. V28. The values of these features are the result of a PCA transformation and are numerical.

Let’s see what the target class parameters are:

As shown above, the Class variable has two values:

- 0 (the credit card transaction is legitimate)
- 1 (the credit card transaction is fraudulent)

Most transactions usually are legitimate and only a small fraction are non-legitimate. Thus, typically you have access to a dataset that is highly unbalanced. This is also the case of the current dataset: only 492 transactions out of 2,848,070 are fraudulent (the positive class — the frauds — accounts for 0.172% of all transactions).

Thus, you need to model a binary classification problem. Moreover, the dataset is highly unbalanced the target variable classes are not represented equally. This case requires special attention when training or when evaluating the quality of a model. One way of handling this case at training time is to bias the model to pay more attention to the samples in the minority class. The models under the current study will be configured to take into account the class weights of the samples at train/fit time.

Data Processing:

Data preprocessing such as scaling/normalization is typically useful for linear models to accelerate the training convergence. We standardize features by removing the mean and scaling to unit variance.

big_raw_data.iloc[:, 1:30] = StandardScaler().fit_transform(big_raw_data.iloc[:, 1:30])
data_matrix = big_raw_data.values

# X: feature matrix (for this analysis, we exclude the Time variable from the dataset)
X = data_matrix[:, 1:30]

# y: labels vector
y = data_matrix[:, 30]

# data normalization
X = normalize(X, norm="l1")

# print the shape of the features matrix and the labels vector
print('X.shape=', X.shape, 'y.shape=', y.shape)
X.shape= (2848070, 29) y.shape= (2848070,)

Now that the dataset is ready for building the classification models, to train the model you can use part of the input dataset and the remaining data can be used to assess the quality of the trained model.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)   
X_train.shape= (1993649, 29) Y_train.shape= (1993649,)
X_test.shape= (854421, 29) Y_test.shape= (854421,)

Model 1. Decision Trees with Scikit-Learn:

Let’s first use Decision Trees for the problem and evaluate the performance of the classifiers from Scikit-learn and Snap ML.

First, we will compute the `sample_weights` to be used as input to the training routine so that it takes into account the class imbalance present in this dataset.

w_train = compute_sample_weight('balanced', y_train)

Let’s import the Decision Tree Classifier Model from scikit-learn and set the max depth of the tree to 4. Let’s train it using our train dataset.

from sklearn.tree import DecisionTreeClassifier
sklearn_dt = DecisionTreeClassifier(max_depth=4, random_state=35)
sklearn_dt.fit(X_train, y_train, sample_weight=w_train)

It took the model approximately 1 min to train. We can evaluate the Compute Area Under the Receiver Operating Characteristic Curve, the ROC-AUC score of the predictions of the classifier using the roc_auc_score command, and the probabilities of the test samples belonging to the class of fraudulent transactions returned by the model. It results in ROC-AUC score of 0.966, which is really good.

Model 2. Decision Trees with Snap-Ml:

Let’s import the Decision Tree Classifier Model from Snap ML. We can reuse the same sample weights we computed before using Scikit-learn’s compute_sample_weight function for training the Decision Tree from Snap ML. The seamless integration of the Snap ML library is possible due to its Scikit-learn Python API compatibility.

snapml_dt = DecisionTreeClassifier(max_depth=4, random_state=45, n_jobs=4)

Next, we create a model. Snap ML offers multi-threaded CPU/GPU training of decision trees, unlike scikit-learn. To set the number of CPU threads used at training time, set the `n_jobs` parameter. Let’s train it using our training dataset:

snapml_dt = DecisionTreeClassifier(max_depth=4, random_state=45, n_jobs=4)
snapml_dt.fit(X_train, y_train, sample_weight=w_train)

It took the model approximately 10 seconds to train. Wow, nice improvement, more than 5 times faster. We can again use the ROC-AUC score of the predictions of the classifier using the roc_auc_score command, and the probabilities of the test samples belonging to the class of fraudulent transactions returned by the model. It results in the same ROC-AUC score: 0.966, which is really good.

As shown above, both Decision Tree classifiers provide the same score on the test dataset. However, Snap ML runs the training routine multiple times faster than Scikit-Learn. This is one of the advantages of using Snap ML: acceleration of training of classical machine learning models, such as linear and tree-based models.

We can do a similar comparison with a Support Vector Machine.

Support Vector Machine

Import the linear Support Vector Machine (SVM) model from Scikit-Learn. Define a scikit-learn SVM model and indicate the class imbalance at training time by setting `class_weight=’balanced’` and train a linear Support Vector Machine model using Scikit-Learn.

from sklearn.svm import LinearSVC
sklearn_svm = LinearSVC(class_weight='balanced', random_state=31, loss="hinge", fit_intercept=False)
sklearn_svm.fit(X_train, y_train)

Timing and testing it with ROC-AUC, we get that the training time is around 87 seconds and the ROC-AUC score of 0.984.

Let’s do the same thing with Snap-ML:

from snapml import SupportVectorMachine
snapml_svm = SupportVectorMachine(class_weight='balanced', random_state=25, n_jobs=4, fit_intercept=False)
model = snapml_svm.fit(X_train, y_train)

Timing and testing it in a similar manner with ROC-AUC, we get that the training time is around 32 seconds and the ROC-AUC score of 0.985.

Once again we can see that both SVM models provide the same score on the test dataset. However, as in the case of Decision Trees, Snap ML runs the training routine way faster than Scikit-Learn.

Moreover, as shown above, not only is Snap ML seamlessly accelerating sci-kit-learn applications, but the library’s Python API is also compatible with sci-kit-learn metrics and data preprocessor.

If you want to know more about the power of SnapML, check the implementation and get more detailed explanations on the above concepts then click on Credit Card Fraud Detection using Scikit-Learn and Snap ML. or explore other FREE courses and projects about data science or machine learning on Cognitive Class.

Thanks for reading. Author:

https://www.linkedin.com/in/artem-arutyunov/

--

--

Artem Arutyunov
The Power of AI

Hey, Artem here, I love helping people to learn, and learn myself. IBM Data Science Intern + Studying Math and Stats at University of Toronto.