Random Forest Classification with H2O [Python][for Beginners]

H2O is an opensource machine learning platform that facilitates you to build models based on data that you have. This article will let you discover how H2O machine learning can apply for simple classification problem.

Note : For this tutorial, you need to setup H2O in your python environment.

import h2o
from h2o.estimators import H2ORandomForestEstimator

To create a Random Forest Classification model H2ORandomForestEstimator will instantiate you a model object.

h2o.init()

Check whether if it is possible to connect to an existing H2O instance. If it fails, attempt to create a local H2O instance at localhost:54321.

Copy ‘iris.csvfile into your project folder. This file contains the data that required to train your model. You need to add headers to the data set manually.

Figure 1 : Adding headers to the data set
# Load data from CSV
data = h2o.import_file('iris.csv')

Read the iris.csv file and load the data as an H2O frame.

'''
Iris data set description
-------------------------

1. sepal length in cm
2. sepal width in cm
3. petal length in cm
4. petal width in cm
5. class:
Iris Setosa
Iris Versicolour
Iris Virginica

'''

Based on sepal length, sepal width, petal length and petal width data it is required to identify the class that each iris flower belongs to.

# Input parameters that are going to train
training_columns = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
# Output parameter train against input parameters
response_column = 'class'

Define the training parameters, input and target parameters.

# Split data into train and testing
train, test = data.split_frame(ratios=[0.8])

Split the data set into train and test. The testing data will help you to verify the validity of your model after creating it. And it will also prevent model over fitting to the given data.

# Define model
model = H2ORandomForestEstimator(ntrees=50, max_depth=20, nfolds=10)

# Train model
model.train(x=training_columns, y=response_column, training_frame=train)

Define the model with required parameters and train it.

# Model performance
performance = model.model_performance(test_data=test)

print performance

Finally, it is time to see the performance of the model.

Figure 2: Test Results

You can see that the model identifies the class of iris flowers correctly without having any misprediction. When you run this program, the answers can be slightly varied. Because random forest algorithm uses randomly created trees for ensemble learning. And also when splitting data for training and testing, H2O is using a random splitting which can change the data in each frame.

Full Project