Random Forest and XGBoost on Amazon SageMaker and AWS Lambda

Step-by-Step process for implementing regression model using Random Forest and XGBoost on Amazon SageMaker and AWS Lambda Functions.

Sriramya Kannepalli

Published in

Analytics Vidhya

10 min readApr 30, 2020

Introduction

I wrote this blog as a part of my virtual talk on Deploying ML models using Amazon Sagemaker and Lambda functions in Minneapolis Women in Machine Learning & Data Science(WiMLDS). So, here we go -

The best way to learn how to use Amazon SageMaker is to create, train, and deploy a simple machine learning model on it, we will take a top down approach, we will directly login into AWS Console, start a SageMaker notebook instance, understand Decision Trees(building block of Random forest and XGBoost) and then train and deploy the endpoints to AWS Lambda.

Let’s get started..

2. Search for Amazon SageMaker in ‘Find Services’ and open SageMaker dashboard.

3. Click on Notebook instances and Create Notebook instance.

4. Enter the Notebook instance name.

Select Notebook instance type ‘ml.t2.medium’ from the dropdown. We only plan to use this notebook instance as development environment and rely-on the on-demand environment to execute heavy lifting training and deployment jobs i.e. We will assign ‘ml.m4.xlarge’ instance in our training and deployment scripts. For info on other notebook instance types, please refer Amazon SageMaker Pricing.

5. Grant permissions to the notebook instance through IAM role, so that necessary AWS resources can be accessed from the notebook without the need to provide AWS credentials every time.

If you don’t have IAM role in place, Amazon SageMaker will automatically create a role for you with your permission.

6. Click on ‘Create Notebook Instance’.

7. It takes around 1–2 mins to change into ‘Active’ status from ‘Pending’.

8. Now Click on ‘Open Jupyter’.

9. You can upload your own files from local using ‘upload’ similar to what you do in a normal Jupyter notebook interface.Remember these files are getting saved in the current ‘ml.t2.notebook’ instance and if you decide to delete the notebook instance after your work is done, you will loose the files too.

10. If you are new to SageMaker, you can always refer to the huge list of ‘SageMaker examples’ written by AWS SMEs as a start point.

Now moving on to the Regression with Random Forest & Amazon SageMaker XGBoost algorithm, to do this, you need the following:

A dataset. We will use Kaggle dataset : House sales predicition in King County, Seattle US. This dataset contains sale prices of houses sold in King County, Seattle, between May 2014 and May 2015.It’s a great dataset for evaluating simple regression models.
An algorithm. We will use the Random Forest algorithm in scikit-learn and XGBoost Algorithm provided by Amazon SageMaker to train the model using the housing dataset and predict the prices.

You also need a few resources for storing your data and running the code in Amazon SageMaker:

An Amazon Simple Storage Service (Amazon S3) bucket to store the training data and the model artifacts that Amazon SageMaker creates when it trains the model( don’t worry move on, we will assign this in our code below)
An Amazon SageMaker notebook instance to prepare and process data and to train and deploy a machine learning model (We already started a notebook instance above)
A Jupyter notebook to use with the notebook instance to prepare your training data and train and deploy the model (If are following along from the beginning, we have our Jupyter notebook open)

We will be writing our code in Python 3 -

Important: To train, deploy, and validate a model in Amazon SageMaker, you can use one of these methods.

Amazon SageMaker Python SDK.
AWS SDK for Python (Boto 3).

Amazon Sagemaker Python SDK vs AWS SDK for Python(Boto 3)

The Amazon SageMaker Python SDK abstracts several implementation details, and is easy to use. If you’re a first-time Amazon SageMaker user, aws recommends that you use it to train, deploy, and validate the model.

On the other hand, Boto 3 is the Amazon Web Services (AWS) SDK for Python. It enables Python developers to create, configure, and manage AWS services, such as EC2 and S3. Boto provides an easy to use, object-oriented API, as well as low-level access to AWS services.

Today we will learn how to create all of the resources that you need to train, and deploy a model using Amazon SageMaker Python SDK.

The steps include:

Fetching the dataset.
Explore and Transform the Training Data so that it can be fed to Amazon SageMaker algorithms.
Feature Engineering and Data Visualizations.
Prepare the data.
Data Ingestion.
Train a Model.
Launching a training job with the Python SDK.
Deploy the Model to Amazon SageMaker.
Validate the Model.
Integrating Amazon SageMaker Endpoints into Internet-facing Applications.
Clean up

Before we start working with the data let’s quickly understand —

What is a Decision Tree and how Tree Ensembles form the basis for Random Forest and XG Boost?

Let’s start with a decision tree :

Decision Tree

A decision tree is built top-down from a root node and involves partitioning the data into subsets that contain instances with similar values (homogeneous).
Decision tree builds regression or classification models in the form of a tree structure.

The process of repeatedly partitioning the data to obtain homogeneous groups is called recursive partitioning.

Step 1: Identify the binary question that splits data points into two groups that are most homogeneous.

Step 2: Repeat Step 1 for each leaf node, until a stopping criterion is reached.

*Source : Diego Lopez Yse (Apr 17, 2019). Decision tree. Retrieved from Medium:* *https://towardsdatascience.com/the-complete-guide-to-decision-trees-28a4e3c7be14*

Fable of blind men and elephant

The main principle behind the ensemble model is that a group of weak learners come together to form a strong learner, thus increasing the accuracy of the model. In the above picture, four blind men are trying to predict an elephant by touching its parts. Though their predictions are right in their own perspective but they are weak learners in term of predicting an elephant. When these weak learners discuss together they can identify an elephant, hence forming an ensemble.

Wisdom of the Crowd

“In an ensemble, predictions could be combined either by majority-voting or by taking averages. Below is an illustration of how an ensemble formed by majority-voting yields more accurate predictions than the individual models it is based on: ”

*Source : Annalyn Ng and Kenneth Soo (July 27, 2016). How a tree is created in a random forest. Retrieved from algobeans.com:* *https://algobeans.com/2016/07/27/decision-trees-tutorial/*

Bagging and Boosting:

Bagging:

“ Refers to non-sequential learning.
- For T rounds, a random subset of samples is drawn (with replacement) from the training sample.
- Each of these draws are independent of the previous round’s draw but have the same distribution.
- These randomly selected samples are then used to grow a decision tree (weak learner). The most popular class (or average prediction value in case of regression problems) is then chosen as the final prediction value.
The bagging approach is also called bootstrapping.”

Boosting:

“ Boosting describes the combination of many weak learners into one very accurate prediction algorithm.
- A weak learner refers to a learning algorithm that only predicts slightly better than randomly.
- When looking at tree-based ensemble algorithms a single decision tree would be the weak learner and the combination of multiple of these would result in the AdaBoost algorithm, for example.
- The boosting approach is a sequential algorithm that makes predictions for T rounds on the entire training sample and iteratively improves the performance of the boosting algorithm with the information from the prior round’s prediction accuracy. “

Source: Julia Nikulski(Mar 16, 2020). Bagging and Boosting. Retrieved from Medium: https://towardsdatascience.com/the-ultimate-guide-to-adaboost-random-forests-and-xgboost-7f9327061c4f

Random Forest

Now, Random Forest is a combination of tree ensemble and bagging.

“ A random forest is an example of an ensemble, which is a combination of predictions from different models. It also uses bagging. Bagging is used to create thousands of decision trees with minimal correlation. In bagging, a random subset of the training data is selected to train each tree. Furthermore, the model randomly restricts the variables which may be used at the splits of each tree. Hence, the trees grown are dissimilar, but they still retain certain predictive power.”

*Source : Annalyn Ng and Kenneth Soo (July 27, 2016). Wisdom of crowd . Retrived from algobeans.com:* *https://algobeans.com/2016/07/27/decision-trees-tutorial/*

“In the above example, there are 9 variables represented by 9 colors. At each split, a subset of variables is randomly sampled from the original 9. Within this subset, the algorithm chooses the best variable for the split. The size of the subset was set to the square root of the original number of variables. Hence, in our example, this number is 3.”

Now with this understanding let’s move on to Random Forest implementation on Amazon SageMaker Notebook Instance. For this you need to download the Jupyter notebook from here and data from here. Upload them into your SageMaker notebook instance as explained above and follow along.

XGBoost Algorithm

XGBoost (eXtreme Gradient Boosting) was introduced by Chen & Guestrin in 2016.

It was developed mainly to increase speed and performance, while introducing regularization parameters to reduce overfitting.

To begin with, let us first learn about the model choice of XGBoost: decision tree ensembles. The tree ensemble model consists of a set of classification and regression trees (CART). Here’s a simple example of a CART that classifies whether someone will like a hypothetical computer game X.
We classify the members of a family into different leaves, and assign them the score on the corresponding leaf. A CART is a bit different from decision trees, in which the leaf only contains decision values. In CART, a real score is associated with each of the leaves, which gives us richer interpretations that go beyond classification. This also allows for a principled, unified approach to optimization.

So Let’s get started with XGBoost implementation on Sagemaker —

Jupyter Notebook for implementing XGBoost on Amazon Sagemaker notebook instance.

Integrating Amazon SageMaker Endpoints into Internet-facing Applications

In a production environment, you might have an internet-facing application sending requests to the endpoint for inference. The following high-level example shows how to integrate your model endpoint into your application.

For an example of how to use Amazon API Gateway and AWS Lambda to set up and deploy a web service that you can call from a client application -

Create an IAM role that the AWS Lambda service principal can assume. Give the role permissions to call the Amazon SageMaker InvokeEndpoint API.
Create a Lambda function that calls the Amazon SageMaker InvokeEndpoint API.
Call the Lambda function from a mobile application.

Starting from the client side,

A client script calls an Amazon API Gateway API action and passes parameter values.
API Gateway is a layer that provides API to the client. In addition, it seals the backend so that AWS Lambda stays and executes in a protected private network.
API Gateway passes the parameter values to the Lambda function.
The Lambda function parses the value and sends it to the SageMaker model endpoint.
The model performs the prediction and returns the predicted value to AWS Lambda. The Lambda function parses the returned value and sends it back to API Gateway. API Gateway responds to the client with that value.

But, what is AWS Lambda ?

AWS Lambda is a compute service, serverless computing platform provided by Amazon as a part of AWS that lets you run code without provisioning or managing servers.
It is a computing service that runs code in response to events and automatically manages the computing resources required by that code.

For integrating the endpoints created in this notebook with AWS Lambda please read my blog Invoke an Amazon SageMaker endpoint using AWS Lambda.

Final words on Amazon Sagemaker Pricing -

Try Amazon SageMaker for two months, free!

As part of the AWS Free Tier, you can get started with Amazon SageMaker for free. If you have never used Amazon SageMaker before, for the first two months, you are offered a monthly free tier of 250 hours of t2.medium or t3.medium notebook usage for building your models, plus 50 hours of m4.xlarge or m5.xlarge for training, plus 125 hours of m4.xlarge or m5.xlarge for deploying your machine learning models for real-time inferencing and batch transform with Amazon SageMaker. Your free tier starts from the first month when you create your first SageMaker resource.

References

Diego Lopez Yse(Apr 17, 2019). Decision tree. Retrieved from Medium: https://towardsdatascience.com/the-complete-guide-to-decision-trees-28a4e3c7be14
Jinde Shubham.(Jul 3, 2018). Ensemble learning is Fable of blind men and elephant. Retrieved from Medium :https://becominghuman.ai/ensemble-learning-bagging-and-boosting-d20f38be9b1e
Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’16). Association for Computing Machinery, New York, NY, USA, 785–794. DOI:https://doi.org/10.1145/2939672.2939785
Annalyn Ng and Kenneth Soo (July 27, 2016). How a tree is created in a random forest. Retrived from algobeans.com: https://algobeans.com/2016/07/27/decision-trees-tutorial/
AWS SageMaker Screenshots and figures. Retrieved from Amazon Web Services, Inc. : https://docs.aws.amazon.com/sagemaker/#amazon-sagemaker-overview