Tuning a Random Forest Classifier

Thomas Plapinger
6 min readAug 12, 2017

--

From here

One of the most useful models I have come across in my brief time as a Data Scientist is Random Forests. What are Random Forests you ask? They are an ensemble method for classification and regression models that run by compiling multiple decision trees at taking the mode of the classes in classification models and the mean output in regression models. What makes them so great is that it corrects the overfitting of a single decision tree model by using Bagging, also known as Bootstrap Aggregating. Bagging is when the model repeatedly generates a random sample from your training data and fits it to a tree with replacement. This means that each tree that is created within a Random Forest while having the same original data source is compiled on randomly selected samples of that data with the end results being an aggregate of those trees. You may be asking what is the difference then between Bagged Decision Trees and Random Forests. At each split in the multiple decision trees a Random Forest generates a random subset of features to be used as opposed to a Bagged Decision Tree that does also use Bootstrap Aggregating, but it uses the same features for all of its trees.

When in python there are two Random Forest models, RandomForestClassifier() and RandomForestRegressor(). Both are from the sklearn.ensemble library. This article will focus on the classifier.

First, to make your life easier you should import the classifier. The classifier without any parameters included and the import of the sklearn.ensemble library simply looks like this;

from sklearn.ensemble import RandomForestClassifiermodel = RandomForestClassifier() 
#for later in your notebook it will be easier to simply refer to it as model or whatever name you deem appropriate

While you could simply put that in and fit your model to your X, y variables using .fit(X,y) the classifier will perform much better if you use its many different parameters. It is important to know that most, if not all, of the parameters have default settings, but your model will likely benefit in some shape or form from tuning them.

As the title of this posts suggests I will now go through each of these parameters explaining what they are and what they do. For the sake of ease of searching for a precise parameter the order in which in which I will show the parameters and attributes will be as follows:

Parameters:
n_estimators,
criterion,
max_features,
max_depth,
min_samples_split,
min_samples_leaf,
min_weight_leaf,
min_weight_fraction_leaf,
max_leaf_nodes,
min_impurity_decrease,
bootstrap,
oob_score,
n_jobs,
random_state,
verbose,
warm_start,
class_weight

PARAMETERS:

n_estimators-(integer)- Default=10

The number of trees your want to build within a Random Forest before aggregating the predictions. The higher the number the better, but it is important to know this is more computationally expensive and will take longer for your code to run. Based on your previous knowledge of your computers processing speed I suggest making a n_estimator proportional to that speed.

criterion-(string)-Default =”gini

Measures the quality of each split. It can either be “gini” or “entropy”. “gini” uses the Gini impurity while “entropy” makes the split based on the information gain.

max_features-(integer, float, string, or None)-Default=”auto

The max number of features considered when finding best split. This improves the performance of the model as each node of each tree is now considering a higher number of options. Once again, by increasing the number of features your processing speed will decrease. max_features is one of the more complicated parameters because it is dependent on the type you set it as.

If it is an integer then you should think carefully about max_features at each split because the number is basically up to you. If it is float it is a percentage (max_features x number of features).

If it is set to auto or sqrt then it is set to the square root of the number of features (sqrt(n_features)).

If you set it to log2 it equals log2(n_features).

None as you might expect simply uses the number of features, or n_features.

max_depth-(integer or none)- Default=None

This selects how deep you want to make your trees. Do you want to split once, twice, or if you select None until the all leaves are their purist or contain less samples than min_samples_split. I suggest setting your max_depth because if you let it run down to the purist option you risk overfitting your model.

min_samples_split-(integer, float)-Default=2

Sets the minimum number of samples that must be present from your data in order for a split to occur. If it is a float then it is calculated with min_samples_split*n_samples.

min_samples_leaf-(integer, float)-Default=1

This parameter helps you determine minimum size of the end node of each decision tree. The end node is also known as a leaf.

min_weight_fraction_leaf-(float)-Default=0

This is quite similar to min_samples_leaf, but it uses a fraction of the sum total number of observations instead.

max_leaf_nodes-(integer, None)-Default=None

This parameter grows the tree in the best-first fashion resulting in a relative reduction in impurity.

min_impurity_decrease-(float)-Default=0

“A node will be split if this split induces a decrease of the impurity greater than or equal to this value.

The weighted impurity decrease equation is the following:

N_t / N * (impurity - N_t_R / N_t * right_impurity
- N_t_L / N_t * left_impurity)

where N is the total number of samples, N_t is the number of samples at the current node, N_t_L is the number of samples in the left child, and N_t_R is the number of samples in the right child.

N, N_t, N_t_R and N_t_L all refer to the weighted sum, if sample_weight is passed.” — From SkLearn Library

Note — The sklearn library has used this over min_impurity split.

bootstrap-(boolean)-Default=True

Whether or not you should bootstrap your samples when building trees. This is something you should never have to change from the default as your model will perform much better with bootstrapping.

oob_score-(boolean)-Default=False

This is a cross-validation method that is very similar to a leave-one out validation technique where the generalized estimated performance of a model is trained on n-1 samples of the data. However, oob_score is much faster because it grabs all observations used in the trees and finds out the maximum score for each observation base on the trees which did not use that observation to train.

n_jobs-(integer)-Default=1

This lets the computer know how many processors it is allowed to use. The default value of 1 means it can only use one processor. If you use -1 it means that there is no restriction of how much processing power the code can use. Setting your n_jobs to -1 will often lead to faster processing.

random_state-(integer, RandomState instance, None)-Default=None

Since the bootstrapping generates random samples it is often hard to exactly duplicate results. This parameter makes it easy for others to replicate your results if given the same training data and parameters.

verbose-(integer)-Default=0

Verbose means that you are setting the logging output which gives you constant updates about what the model is doing as it processed. This parameter sets the verbosity of the tree’s building process. It is not always useful and may take up an unnecessary space in your notebook.

warm_start-(boolean)-Default=False

At False it fits a new forest each time as opposed to when it is True it adds estimators and reuses the solution of the previous fit. It is mostly used when you are using recursive feature selection. This means that when you drop some features other features will gain in importance and to “appreciate” the trees they must be reused. It is often used with backward elimination in regression models and not often used in classification models

class_weight-(dictionary, list of dictionaries, “balanced”)

None, also known as “balanced subsample” weights the classes. If you do not put something here it will assume that all classes have a weight of 1, but if you have a multi-output problem a list of dictionaries is used as the columns of y.

When the “balanced” mode is used the y values automatically adjust their “weights inversely proportional to class frequencies” in the data using “n_samples/(n_classes*np.bincount(y))”

The balanced subsample and the balanced mode are the same except that balanced subsample used bootstrapped weights for each tree.

I hope you have found this article helpful. For further descriptions, examples, and further steps you can take in tuning your Random Forest Classifier I suggest clicking this link for the SkLearn library descriptions. If you want to see the parameters for the RandomForest Regressor model please click here.

--

--