Real-Time Prediction of Credit Card Fraud deployed on AWS using Spark and XGBoost (Part I)

5 min readMay 28, 2020

When we swipe or insert our credit card to purchase something, there is a backend fraud model kickoff to either block or let you go through the transaction. The below project is using state-of-art technologies as of today to mimic the process.

Think about the nature of the business goal, it requires a quick response (real-time) and a high recall rate. (We’d better block payments rather than let someone steal money).

The goal of this article is to show you how the various technologies working together on AWS and how we can deploy the trained model waiting for the client application to call in and return a result in real-time.

I am going to use the public Credit Card dataset as you can find in the below link.

https://www.kaggle.com/mlg-ulb/creditcardfraud

High-level Methodologies

The model we going to use will be XGBoost, which is the Kaggle superstar. The parameters for the model are pre-tunned. I should write a separate article to demo why we pick up this model and how we tune the parameters, but for the purpose of this article, parameters are pre-defined.

The entire Notebook and codes can be found on my GitHub Page. I try to show you the essential steps in this article.

The article consists of below a few high-level steps

ETL step: AWS Glue. By executing the SparkML feature processing job written in Python, the data will be ready for SageMaker to train and extract from AWS S3. (Part I)
Building and Training: Using SageMaker XGBoost to train on the processed dataset produced by the SparkML job. (Part II)
Deploy to an Inference Pipeline consisting of SparkML & XGBoost models for a realtime inference endpoint. (Part II)
Setup a Lambda and API Gateway to have the client application call the endpoint and return the result. (Part III)

Part I: Feature Processing by using SparkML job executed by AWS Glue

According to AWS website, AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics. We will use AWS Glue to execute Spark jobs.

Finding out the current execution role of the Notebook

The very first step is to check your AWS role and make sure you can execute AWS Glue. The permission you will need is as below:

from sagemaker import get_execution_role
sess = sagemaker.Session()
role = get_execution_role()

Your role can be set up when you create Jupyter Notebook Instance

Make sure your role has below trust relationship

Creating an S3 bucket and uploading this dataset

import boto3
import botocore

boto_session = sess.boto_session
s3 = boto_session.resource(‘s3’)
account = boto_session.client(‘sts’).get_caller_identity()[‘Account’]
region = boto_session.region_name
default_bucket = ‘aws-creditcrdfraud-{}-{}’.format(account, region)

sess.upload_data(path=’creditcardfraud.csv’, bucket=default_bucket, key_prefix=’input/creditcard’)

You can check out your S3 bucket now and make the data have been loaded to Amazon S3

SparkML Python Script for feature processing

I am sure you wanna using Scala for Spark work as it can process 10 times quicker than Python code. As a data scientist, I am more familiar with Python and its Spark libraries. As long as we can meet the lantancy requirement, there is no harm to use Python at this point.

creditcardprocessing.py is the script that we transform the features. It actually pretty standard. You can do oneHot encode or calling other SparkML API here to complete the more complex transform work.

Once the SparkML Pipeline fit and transform completed, our dataset split into 80/20 train and test parts. The data will be loaded to S3 bucket as train and validation.

script_location = sess.upload_data(path=’creditcardprocessing.py’, bucket=default_bucket, key_prefix=’codes’)

Next we need to serialize the data by using MLeap. MLeap is is a serialization format and execution engine for machine learning pipelines. When you run a Spark ML job on AWS Glue, a Spark ML pipeline is serialized into MLeap format. Then, you can use the job with the SparkML Model Serving Container in an Amazon SageMaker Inference Pipeline.

By using the SerializeToBundle() method from MLeap in the script, we are serializing the ML Pipeline into an MLeap bundle and uploading to S3 in tar.gz format as SageMaker expects.

Upload MLeap dependencies to S3

In order for my AWS Glue to execute the ETL work in MLeap format, the dependency library also needs to be uploaded to the same S3 bucket explicitly.

Below two packages, mleap python zip and assembly jar need to download and uploaded to S3 to finish our work.

python_dep_location = sess.upload_data(path=’python.zip’, bucket=default_bucket, key_prefix=’dependencies/python’)

jar_dep_location = sess.upload_data(path=’mleap_spark_assembly.jar’, bucket=default_bucket, key_prefix=’dependencies/jar’)

Defining output locations for the data and model

By defining the output location, the transformed data will be uploaded correctly. We also specified a model location where the MLeap serialized model can be updated.

timestamp_prefix = strftime(“%Y-%m-%d-%H-%M-%S”, gmtime())
s3_input_bucket = default_bucket
s3_input_key_prefix = ‘input/creditcard’
s3_output_bucket = default_bucket
s3_output_key_prefix = timestamp_prefix + ‘/creditcard’
s3_model_bucket = default_bucket
s3_model_key_prefix = s3_output_key_prefix + ‘/mleap’

Creating Glue Job

We get everything ready and finally we can invoke create_job API for Glue to start its work. We pass all necessary location and file name to Glue as well as dependencies location.

We are using boto to create Glue Client. Boto is a AWS SDK for Python.

We have AllocatedCapacity parameter to control the hardware resources.

glue_client = boto_session.client(‘glue’)
job_name = ‘sparkml-creditcard-’ + timestamp_prefix
response = glue_client.create_job

AllocatedCapacity=5

glue_job_name = response[‘Name’]

By calling start_job_run to kick off job.

job_run_id = glue_client.start_job_run(JobName=job_name)

We can use job_run_id to check the job status

job_run_status = glue_client.get_job_run(JobName=job_name,RunId=job_run_id)[‘JobRun’][‘JobRunState’]

If the job Succeeded, we will have the CSV format data in S3 and get ready for applying XGBoost model for training. If the job gets Failed, you can go to AWS Glue console to check the reason for failure.

Next part we will train and deployed model onto AWS endpoint using Inference Pipeline.