☁️ Income Classification on AWS

Using Amazon SageMaker XGBoost Container as a Built-in Algorithm

Cansu Ergün
HYPATAI
7 min readNov 7, 2020

--

A Cloud Sea Picture from Black Sea Region of Turkey. Photo Credit: Me 👩🏻

Recently I spent some time familiarizing myself with some AWS services and decided to give SageMaker a try. To do that, I picked the Implementation on XGBoost on Income Data as my task and moved what I did on that story to my new notebook instance I created in Amazon SageMaker. Let’s see what I have created, going through the notebook on my AWS console step by step. The whole notebook SageMaker_IncomePrediction_XGBoost.ipynb could be found on my GitHub page.

First, let’s pick an EC2 instance that runs the Jupyter notebook. The instance type ml.t2.medium is the least expensive option in my AWS region, so I picked that one.

My Notebook Instance

By IAM role creation, it is possible to control notebook access to the Amazon S3 buckets that we will be using. More info on IAM roles could be found here.

Identity Access Management Role Creation

The s3 bucket to store our input and output data for this story will be hypatai-income-bucket. The objects are public since basically, we will be using the same input data we used in the story Implementation on XGBoost on Income Data, which is already publicly available on my GitHub page here.

hypatai-income-bucket in s3

As well as using preprocessed data on my GitHub page, I also uploaded the XGBoost model we created in the story Implementation on XGBoost on Income Data. We will also create the sagemaker/ prefix inside the bucket, to put the data and XGBoost model which we will produce in the new SageMaker notebook instance.

Objects in hypatai-income-bucket

Now we are ready to go to our Jupyter notebook.

After getting the role for executing our model training we will show in future screenshots, let’s specify the s3 bucket and prefix to be used. In order to put my notebook into its shareable state, I tried several train jobs since I am new with the SageMaker service and continued after several errors which led me to end up with a number of different output versions like several model objects and prediction data. It is always better to keep each version of a training object in a data science project, however, for demonstration purposes in this story, I start the notebook by deleting existing versions under sagemaker/ prefix.

I created a function for reading objects in s3 by checking their file type using boto3 client.

However, it is easier to do it when we provide a direct s3 path.

Reading the model we created in Implementation on XGBoost on Income Data, helps us remember the hyperparameters and the random state we provided before. Remember, this notebook is just a copy of the notebook we created in the previous story with the same data, algorithm, and hyperparameters. The only difference will be using SageMaker’s built-in XGBoost algorithm and working on AWS console. Therefore we also expect to see the same performance metrics and the same predictions.

Let’s get the XGBoost algorithm container. I used the container with the framework version 1.0-1 which is the latest version which has improved flexibility, scalability, extensibility, and Managed Spot Training.

It is also possible to fully customize the training script with this new version, however as stated here, “Amazon SageMaker provides XGBoost as a built-in algorithm that you can use like other built-in algorithms. Using the built-in algorithm version of XGBoost is simpler than using the open-source version because you don’t have to write a training script.” Therefore I continue with the built-in version to try the without script alternative rather than the open-source version which Amazon Sagemaker also supports.

Before starting the training job, let's put our target column as the first column in our training data, and get the validation set.

Let’s save the prepared data to be used in fitting the model and validation set into s3 under sagemaker/ prefix, without headers and indices in csv format, since this is the format SageMaker’s XGBoost algorithm expects. As explained here, version 1.0–1 also supports parquet format, however, since we are dealing with very small data in this example. I continued with csv. Let’s also convert them into s3_input.

The ml instance I used is ml.m4.xlarge (for pricing see here). I also provided the container, IAM role, and the output s3 path of the XGBoost model to be created. The path will start as sagemaker/model-xgboost-sagemaker/ to organize our s3 bucket in a more proper way.

The end of our training is shown below. Seems like we have the same AUC values at the same iterations we got in Implementation of XGBoost on Income Data.

This is how our hypatai-income-bucket in s3 looks like right now. We will also save the model name including its version to put the predictions we will get to the corresponding model version later.

In order to use the model we created, let’s download and open the tar file in s3.

Since SageMaker XGBoost currently does not provide an interface to retrieve feature importance directly, first we need to get the feature names from get_score() method. The method, however, gives the names in ‘fX’ (X:number) format, so we need to find the related feature names from our original train set.

As expected we get the same features which we got in the previous story .in our top ten list.

Let’s deploy our model to a hosted endpoint to make predictions. The same instance type is used for deployment. Let’s also define the serializer to pass our validation data into a NumPy array to the model behind the endpoint.

I also created an endpoint from the configuration corresponding to the model version we have.

Now it is time to make a POST request to our created endpoint to get the predictions for the validation set. It is also necessary to convert predictions into the array format with the correct data type in order to use them in our plot_roc_curve function.

Let’s define the plot function for the Roc

Again, as expected, we get the same results we had in our previous Hypatai story. Let’s convert the prediction array of the validation set into dataframe, and save it under the prediction prefix related to the current model version in s3.

Our hypatai-income-bucket in s3 looks like below right now. The path starts with sagemaker/preds-xgboost-sagemaker/ to make our bucket in a more organized format. Let’s also remove the endpoint we created to get away from unnecessary costs since this endpoint is created only for this story.

Note: I spent $0.21 on AWS for creation of this notebook 👻

Wish you more clouds ☁️ ☁️ from Hypatai !! ⛅️

--

--