Diving Deep into Gradient Boosted Machines with H2O.ai

Saptarshi Dutta Gupta
Feb 3 · 9 min read

This blog is written and maintained by students in the Professional Master’s Program in the School of Computing Science at Simon Fraser University as part of their course credit. To learn more about this unique program, please visit here.

Authors: Roshni Shaik, Sameer Pasha, Saptarshi Dutta Gupta, Harikrishna Karthikeyan

Are you curious about what Ensemble Learning is? Fret not! You will be surprised to know that we have all used Ensemble learning methods to make routine decisions without even realizing it. For instance, when we decided to write this blog post on Gradient Boosted Machines, we have constantly improved the contents of the blog based on the collective feedback received from multiple people. This is exactly what an ensemble learning method does. Voilà, now you know!

In Machine learning and Statistics, Ensemble methods make use of predictions from multiple learning techniques to obtain better predictive performance than any of its constituent learning algorithms. It can be applied to complex and convoluted problems where decision making is difficult.

From weak learners to strong learners- Here’s how!

Before we get too technical, let us consider an example of classifying pictures of motorcycles and bicycles to get a better understanding of what we mean by weak learners.

The various rules for distinguishing between the two classes can be as follows:

  • The image has wide wheels (Class Motorcycle)
  • The image has a fuel tank (Class Motorcycle)
  • The image has an engine/motor (Class Motorcycle)
  • The image has pedals (Class Bicycle)
  • The image has an exhaust (Class Motorcycle)
  • The image has a visible chain and gears (Class Bicycle)
  • The image has a single-seat (Class Bicycle)

These rules are individually called weak learners because they are not independently strong enough to give an accurate prediction. The solution is to combine all these rules to improve the accuracy of the prediction. One way to make a strong prediction from weak learners is to use the majority rule or weighted averages.

Let’s say, our test image has an object containing wide wheels (Class Motorcycle), a fuel tank (Class Motorcycle) and a single-seat (Class Bicycle). In this case, by the majority rule, the object is classified as a Motorcycle.

  • Bagging (Parallel Ensemble) — Weak learners are produced parallelly during the training phase. The models are trained on a bootstrapped dataset — a randomly drawn subset of the initial dataset. An example of bagging is Random Forest where the learners vote with equal weight.
  • Boosting (Sequential Ensemble) — Weak learners are produced sequentially during the training phase. The learners do not vote with equal weight in the case of boosting as explained in the sections below. For example, decision trees can use boosting to improve their performance.
Bagging vs Boosting

Adaptive boosting (AdaBoost)

AdaBoost was the first practical boosting algorithm designed in 1995 which placed more emphasis on the misclassified data points by assigning a higher weight to them. It is important to note that AdaBoost is sensitive to noisy data and outliers.

Let’s look at the working of the algorithm:

  1. Give an equal weightage to all data points and draw out the decision stump for a single input feature.
  2. Analyze the results and increase the weights for misclassified data points.
  3. Draw a new decision stump taking into account the updated weights.
  4. Repeat steps 2 and 3 until all samples are correctly classified.
Constructing a strong classifier from a set of weighted weak classifiers

Gradient Boosting

Gradient Boosting is a sequential ensemble learning technique that produces a model in a stage-wise fashion by optimizing the loss function. In a Random Forest classifier, the decision trees are parallelly trained on different parts of the dataset and we select those results which are generated by the majority of the decision trees. However, in the Gradient Boosting algorithm, the weak learners are built and trained sequentially in such a way that the present base learner is always more effective than the previous one.

To begin with, all the training samples are assigned the same initial weights. Next, a weak base learner, say a Decision Tree T1 is built based on a random split of the training data (bootstrapped dataset). The model then predicts the output for all the training data and the error is computed using the loss function. Further, the loss function is optimized by updating the gradients using the Stochastic Gradient descent method. Thus the updated model is a new Decision Tree T2 built on top of T1 and retrained as shown in the formulae below where the loss function(Lₙ) is multiplied by some α in order to minimize it. This process is repeated until an acceptable loss is attained.

In conclusion, the weak learners are used collectively for making predictions on the test data and the combined output is generated as the final prediction.

eXtreme Gradient Boosting (XGBoost)

XGBoost is an advanced version of the Gradient boosting method that is designed to focus on computational speed and model efficiency. Since the Gradient Boosting algorithm uses sequential learning, it computes the output at a very slow rate. To overcome this, XGBoost was introduced to improve the performance of the model. It supports parallelization by building decision trees parallelly and also implements distributed computing methods to build large and complex models. It also employs cache optimization techniques to make the best use of hardware computing resources.

Model parameters: Some of the parameters that affect model performance are max_depth, num_parallel_tree, predictor, etc. The maximum depth of the tree is set by the max_depth parameter and increasing this value will make the model more complex and prone to overfitting. It is important to note that XGBoost aggressively consumes memory when training a deep tree. Further, the extent of parallelization is set by the num_parallel_tree parameter that specifies the number of parallel trees in each iteration. Finally, the predictor parameter allows the user to choose between the GPU or CPU for carrying out the computation.

XGBoost based Flight Prediction Model with H2O.ai

What is H2O?

H2O is an open-source, distributed and in-memory platform created by the company H2O.ai which supports some of the most widely used Machine Learning algorithms including generalized linear models, gradient boosted machines and the much-hyped AutoML. With some of the top companies including PayPal and Cisco moving towards H2O Driverless AI, let us delve deeper into what this framework has to offer.

Why H2O when we have packages like scikit-learn already?

  • Powerful: H20 can do really well on “Big” datasets in a distributed manner. ML models can be executed seamlessly on a Hadoop, Spark or any other cluster-computing framework.
  • Memory efficiency: H2O data frames have a smaller memory footprint in comparison to Pandas data frames. The training time of the models is much faster than some of its competitors like scikit-learn.
Working of H2O on a Spark cluster
  • Cross-platform support: H2O makes it really easy to incorporate Machine Learning Models into an existing non-python based product, for instance, Java. To implement the same in sklearn, we would need to write a wrapper.
  • Ease of use in tree-based models: Particular to boosting trees, scikit learn does not automatically perform the one-hot encoding of categorical variables which is implicitly done in H2O.

Of course, we still love and would keep using sklearn in addition to H2O for experimental or complicated data science tasks. However, H2O is much easier for scaling and productionization of data mining applications.

Let’s Flow into H2O: Introducing the H2O Flow Framework

We are going to show how easy it is to use H2O Flow to build an Airline Delay prediction model from scratch. H2O Flow is a standalone interface for H2O which combines data parsing, preprocessing, exploratory data analysis, feature selection and training of the models all under one roof. You name it, H2O Flow has got it!

Download the package from here and run the following commands.

cd ~/Downloads
unzip h2o-3.20.0.2.zip
cd h2o-3.20.0.2
java -jar h2o.jar

Navigate to http://localhost:54321 to open up the H2O Flow web GUI.
[If you don’t have JAVA in your system: Follow this link]

Download the flight delay data here.

Once you have downloaded the data, you can import it to the Flow framework and parse the data as shown below:

File Parsing using H2O Flow Framework

Next, we split the parsed data into three frames: training, validation and testing in the ratio 7:2:1 as demonstrated below:

Splitting the parsed data in the H2O Flow framework
EDA of the feature ‘WeatherDelay’

Our main motivation for this use case is to predict if a flight will be delayed (arrival/departure). For that purpose, we have made use of Flow’s EDA tools to identify the most significant features that contribute to the prediction of our classifier.

For instance, the feature WeatherDelay has around 79.68% missing values and can thus be disregarded. Similarly, by analyzing the other features we have chosen 16 features that are highly correlated and form a good feature set for our classification model.

After feature engineering, we have built our model using the XGBoost algorithm. We have fixed the num_parallel_tree to 50 and max_depth to 6. Below animation shows the training process:

Building the XGBoost Model on H2O Flow framework

After training the XGBoost model, below is the visualization of our results:

Training and validation loss vs number of trees (left) Variable Importance of the features (right)

From the above graphs, it can be inferred that on increasing the number of trees, the training and validation loss has significantly decreased and has then remained constant after a certain threshold (50 in our case). This indicates that the number of weak learners increased the model accuracy up to a certain extent. However, increasing it further would cause the model to overfit as discussed before.

Secondly, from the variable importance plot, it can be observed that the Departure Time, CSRDepTime and AirTime are the three main attributes that are used to make the key decisions while predicting the delay.

Training and validation metrics

Deploying the model on legacy JAVA systems by converting the model into a POJO

Converting the model into a POJO

Using the AutoML feature in the Flow framework we can see that the XGBoost algorithm outperforms all the other model implementations

AutoML on the Flight Prediction Dataset in the H2O Flow framework

Thanks for reading this article! We hope you’ve got an in-depth understanding of gradient boosted machines with the H2O Flow framework. Looking forward to hearing your suggestions and feedback. Please feel free to connect with us on LinkedIn.

Cheers!

References

[1] https://machinelearningmastery.com/gentle-introduction-xgboost-applied-machine-learning/
[2] http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/gbm.html
[3] Nathalie Kuhn and Navaneeth Jamadagni: Application of Machine Learning Algorithms to Predict Flight Arrival Delays

SFU Big Data Science

Blog posts on Big Data, Data Science, and Artificial Intelligence written by SFU Students and Professors

Saptarshi Dutta Gupta

Written by

Professional Master’s Student in Big Data at Simon Fraser University

SFU Big Data Science

Blog posts on Big Data, Data Science, and Artificial Intelligence written by SFU Students and Professors

More From Medium

More from SFU Big Data Science

More from SFU Big Data Science

219

More from SFU Big Data Science

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade