Real-Time Prediction of Credit Card Fraud deployed on AWS using Spark and XGBoost (Part II)

5 min readMay 29, 2020

Ok, now we completed ETL work and ready for model training and deployment.

The training we done here in AWS requires a lot of pre-work offline. You can definitely be using AWS Hyperparameter Tuning Job to help tune the parameter model before you deploy. That will be another topic I can address in another article. For this one, let’s assume we have the best parameter and models and see how AWS can help us deploy the model quickly.

Again, the entire notebook and codes can be found on my GitHub page.

Part II: Using SageMaker XGBoost to train on the processed dataset produced by SparkML job

We will use SageMaker XGBoost algorithm to train the dataset.

We need to retrieve the XGBoost algorithm image

from sagemaker.amazon.amazon_estimator import get_image_uri

training_image = get_image_uri(sess.boto_region_name, ‘xgboost’, repo_version=”latest”)
print (training_image)

Next we will setup XGBoost model parameters and dataset details will be set properly, including the location for training data, validation data and output location.

As mentioned earlier, the hyperparameter is pre-defined here. eta, gamma,max_depth, num_round, and subsamples, etc are the parameters we passed in.

xgb_model = sagemaker.estimator.Estimator(training_image,
role,
train_instance_count=1,
train_instance_type=’ml.m4.xlarge’,
train_volume_size = 20,
train_max_run = 3600,
input_mode= ‘File’,
output_path=s3_output_location,
sagemaker_session=sess)

xgb_model.set_hyperparameters(objective = “reg:linear”,
eta = .2,
gamma = 4,
max_depth = 5,
num_round = 10,
subsample = 0.7,
silent = 0,
min_child_weight = 6)

Finally XGBoost training will be performed

xgb_model.fit(inputs=data_channels, logs=True)

2020-05-26 20:55:08 Starting - Starting the training job...
2020-05-26 20:55:11 Starting - Launching requested ML instances.........
2020-05-26 20:56:53 Starting - Preparing the instances for training......
2020-05-26 20:58:10 Downloading - Downloading input data...
2020-05-26 20:58:32 Training - Downloading the training image..
2020-05-26 20:59:04 Uploading - Uploading generated training model
2020-05-26 20:59:04 Completed - Training job completed
Arguments: train
[2020-05-26:20:58:52:INFO] Running standalone xgboost training.
[2020-05-26:20:58:52:INFO] File size need to be processed in the node: 66.51mb. Available memory size in the node: 8458.59mb
[2020-05-26:20:58:52:INFO] Determined delimiter of CSV input is ','
[20:58:52] S3DistributionType set as FullyReplicated
[20:58:52] 227737x20 matrix with 4554740 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,
[2020-05-26:20:58:52:INFO] Determined delimiter of CSV input is ','
[20:58:52] S3DistributionType set as FullyReplicated
[20:58:52] 57070x20 matrix with 1141400 entries loaded from /opt/ml/input/data/validation?format=csv&label_column=0&delimiter=,
[20:58:53] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 16 extra nodes, 20 pruned nodes, max_depth=5
[0]#011train-rmse:0.400208#011validation-rmse:0.400245
[20:58:53] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 10 extra nodes, 20 pruned nodes, max_depth=4
[1]#011train-rmse:0.32045#011validation-rmse:0.320552
[20:58:53] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 16 extra nodes, 14 pruned nodes, max_depth=5
[2]#011train-rmse:0.256653#011validation-rmse:0.256848
[20:58:54] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 6 extra nodes, 26 pruned nodes, max_depth=2
[3]#011train-rmse:0.205756#011validation-rmse:0.206044
[20:58:54] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 16 extra nodes, 16 pruned nodes, max_depth=5
[4]#011train-rmse:0.165056#011validation-rmse:0.165504
[20:58:54] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 2 extra nodes, 28 pruned nodes, max_depth=1
[5]#011train-rmse:0.132704#011validation-rmse:0.133278
[20:58:54] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 6 extra nodes, 34 pruned nodes, max_depth=3
[6]#011train-rmse:0.106905#011validation-rmse:0.10764
[20:58:55] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 2 extra nodes, 40 pruned nodes, max_depth=1
[7]#011train-rmse:0.086489#011validation-rmse:0.087407
[20:58:55] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 4 extra nodes, 28 pruned nodes, max_depth=2
[8]#011train-rmse:0.070302#011validation-rmse:0.071457
[20:58:55] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 0 extra nodes, 36 pruned nodes, max_depth=0
[9]#011train-rmse:0.057721#011validation-rmse:0.059106
Training seconds: 54
Billable seconds: 54

Part III Building an Inference Pipeline consisting of SparkML & XGBoost models for a realtime inference endpoint

Ok, now the model has been trained and we can deployed the modle on our Cloud. However, for a real-time prediction, we need to use AWS Inference Pipeline to deploy the model on an endpoint.

According to AWS, an inference pipeline is an Amazon SageMaker model that is composed of a linear sequence of two to five containers that process requests for inferences on data. You use an inference pipeline to define and deploy any combination of pre-trained Amazon SageMaker built-in algorithms and your own custom algorithms packaged in Docker containers. You can use an inference pipeline to combine preprocessing, predictions, and post-processing data science tasks. Inference pipelines are fully managed.

To deploy a model in SageMaker requires two components:

Docker image residing in ECR
Model artifacts residing in S3

For SparkML, Docker image for MLeap based SparkML serving is provided by AWS. SageMaker SparkML Serving Container lets you deploy an Apache Spark ML Pipeline in Amazon SageMaker for real-time, batch prediction and inference pipeline use-cases. The container can be used to deploy a Spark ML Pipeline outside of SageMaker as well. It is powered by open-source MLeap library.

For XGBoost, we will use the same Docker image we used for training. The model artifacts for XGBoost was uploaded as part of the training job we just ran.

Passing the schema of the payload via environment variable

SparkML serving container needs to know the schema of the request that’ll be passed to it while calling the predict method. There are multiple ways to pass the schema, you can actually pass JSON schema along with the data when you call the model for prediction or, in a more preferable way, using sagemaker-sparkml-serving to pass it via an environment variable while creating the model definitions.

Below we are using the easlier way.

schema_json = json.dumps(schema)

Creating a PipelineModel which comprises of the SparkML and XGBoost model in the right order

Next we’ll create a SageMaker PipelineModel with SparkML and XGBoost.The PipelineModel will ensure that both the containers get deployed behind a single API endpoint in the correct order.

Here, during the Model creation for SparkML, we will pass the schema definition that we built in the previous cell.

sparkml_model = SparkMLModel(model_data=sparkml_data, env={‘SAGEMAKER_SPARKML_SCHEMA’ : schema_json})
xgb_model = Model(model_data=xgb_model.model_data, image=training_image)

model_name = ‘inference-pipeline-’ + timestamp_prefix
sm_model = PipelineModel(name=model_name, role=role, models=[sparkml_model, xgb_model])

Deploying the PipelineModel to an endpoint for realtime inference

It is quite amazing that we can deploy the model to an endpoint in just one line of code

endpoint_name = ‘inference-pipeline-ep-’ + timestamp_prefix
sm_model.deploy(initial_instance_count=1, instance_type=’ml.c4.xlarge’, endpoint_name=endpoint_name)

You can monitor your endpoint activity in your AWS Console.

Ok, now let’s do some test:

Passing as CSV format payload

payload = “-1.359807134,-0.072781173,2.536346738,1.378155224,-0.33832077,0.462387778,0.239598554,0.098697901,0.36378697,0.090794172,-0.551599533,-0.617800856,-0.991389847,-0.311169354,1.468176972,-0.470400525,0.207971242,0.02579058,0.40399296,149.62”
predictor = RealTimePredictor(endpoint=endpoint_name, sagemaker_session=sess, serializer=csv_serializer,
content_type=CONTENT_TYPE_CSV, accept=CONTENT_TYPE_CSV)

0.0539724230766

Passing as JSON format payload

payload = {“data”: [-1.359807134,-0.072781173,2.536346738,1.378155224,-0.33832077,0.462387778,0.239598554,0.098697901,0.36378697,0.090794172,-0.551599533,-0.617800856,-0.991389847,-0.311169354,1.468176972,-0.470400525,0.207971242,0.02579058,0.40399296,149.62]}
predictor = RealTimePredictor(endpoint=endpoint_name, sagemaker_session=sess, serializer=json_serializer,
content_type=CONTENT_TYPE_JSON, accept=CONTENT_TYPE_CSV)

0.0539724230766

Yes, we made it. Now the model is on AWS Endpoint, in the final article, we will see how a client application can call the endpoint for a prediction by using AWS Lambda and API Gateway