Random Forest Classification with H2O [Python][for Beginners]

Roshan Alwis
Oct 27, 2016 · 3 min read
H2O is an opensource machine learning platform that facilitates you to build models based on data that you have. This article will let you discover how H2O machine learning can apply for simple classification problem.

Note : For this tutorial, you need to setup H2O in your python environment.

To create a Random Forest Classification model H2ORandomForestEstimator will instantiate you a model object.

Check whether if it is possible to connect to an existing H2O instance. If it fails, attempt to create a local H2O instance at localhost:54321.

Copy ‘iris.csvfile into your project folder. This file contains the data that required to train your model. You need to add headers to the data set manually.

Figure 1 : Adding headers to the data set

Read the iris.csv file and load the data as an H2O frame.

Based on sepal length, sepal width, petal length and petal width data it is required to identify the class that each iris flower belongs to.

Define the training parameters, input and target parameters.

Split the data set into train and test. The testing data will help you to verify the validity of your model after creating it. And it will also prevent model over fitting to the given data.

Define the model with required parameters and train it.

Finally, it is time to see the performance of the model.

Figure 2: Test Results

You can see that the model identifies the class of iris flowers correctly without having any misprediction. When you run this program, the answers can be slightly varied. Because random forest algorithm uses randomly created trees for ensemble learning. And also when splitting data for training and testing, H2O is using a random splitting which can change the data in each frame.

Full Project

Tech Vision

Things I have learned :)

