Deploy Machine Learning Model using Amazon SageMaker (Part 2)
In this article, we will continue our series on deploying a machine learning model using Amazon SageMaker.
This is part of a series:
Set up Sage Maker — Part 1
Data Preprocessing — Part 2 (You are here)
Train the Model — Part 3
We will use the adult census data, and the main focus of this article is to prepare the dataset for training by loading, exploring, and transforming it.
Steps
- Use the SHAP library to load and explore the dataset.
- Use Sklearn library to split the dataset into train, validation, and test datasets.
- Use s3 and boto3 to upload the dataset to Amazon s3.
Now that we have everything in place, we can proceed.
— -
Load the adult census data
Install Shapley Adjective Explanations (SHAP) library
In this step, we will use the Shapley Adjective Explanations (SHAP) library to load the adult census data to the notebook instance.
First, we need to install the SHAP library because the Jupyter kernel does not have the SHAP library. We can install it by running the command:
Explore and Transform Dataset
After installing the SHAP library, we need to explore and transform the dataset. This is achieved by importing the SHAP library and the adult census dataset.
In the code above, we imported the shape library and used the dataset function to import the adult census dataset.
We also created 2 sets of variables X and y, and also X_display and y_display to keep the train and test datasets.
We have also gone ahead to display all the feature names which is basically a list object that lists the features of this dataset. From the output, we can see that there are 12 different feature names including Age, Workclass, Education-Num, Marital Status, Occupation, Relationship, Race, Sex, Capital Gain, Capital Loss, Hours per week, and Country.
We can also see the statistical overview of the dataset and histograms of the numeric features by running the following command.
With the figure size of 20 by 10, we have a proper repeat that describes the entire data set.
The output just gives a good overview of the dataset.
Splitting the Dataset
Using sklearn lbrary, we split the data set into a training set of 80% and a test set of 20% while using the random state parameter to ensure the dataset is randomly sorted.
Next, we get the validation set out of the training set. Ideally, that is 25% of the training set, and the remaining 75% really becomes the final training set.
Next, we use the Pandas package, one of the most important packages in data science, to explicitly align each dataset by concatenating the numeric features with the true labels.
We can print the train dataset to check if the dataset is split and structured as expected.
…and print the validation dataset to see an overview of the validation dataset.
…and finally, the test dataset to also see an overview of test dataset.
The number of rows and columns for each dataset is specified at the bottom of each of these outputs.
Next, we need to convert the train and validation datasets to csv files. This is important because we are going to use the xgboost algorithm which expects csv as the input file format.
Note that the first column here is the output column.
Upload the datasets to Amazon s3
We need to use SageMaker and Boto3 libraries to upload the training and validation datasets to the default Amazon s3 bucket. The datasets in the s3 bucket will then be used by a computer optimized SageMaker instance on Amazon EC2 for training.
The code above sets up the default s3 bucket URI for the current SageMaker session, created the new demo-sagemaker-xgboost-adult-income-prediction folder, and then uploads the training and validation datasets to the data subfolder.
Check if the datasets have been uploaded to s3
We can check if the csv files have been successfully uploaded to the s3 bucket by running the following AWS CLI.
The output shows that the train and validation csv files have been uploaded.
The next step is to train the model which will be done in the next part.