Random Forest Made Easy

WHAT ARE RANDOM FORESTS?

Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks. Random forest is a universal machine learning technique.

  • It can predict something that can be of any kind — it could be a category (classification), a continuous variable (regression).
  • It can predict with columns of any kind — pixels, zip codes, revenues, etc (i.e. both structured and unstructured data).
  • It does not generally overfit too badly, and it is very easy to stop it from overfitting.
  • You do not need a separate validation set in general. It can tell you how well it generalizes even if you only have one dataset.
  • It has few, if any, statistical assumptions. It does not assume that your data is normally distributed, the relationship is linear, or you have specified interactions.
  • It requires very few pieces of feature engineering. For many different types of situation, you do not have to take the log of the data or multiply interactions together.

PRINCIPAL OBJECTIVE OF RANDOM FOREST

Principal objective of Random Forest is :-

  • We have to make multiple trees (like 10,20 or more until over-fitting) but we have to make sure that they must be as uncorrelated as possible.
  • Why uncorrelated? Because when models are uncorrelated then they draw different insights form the data and when we ensemble them we get a great model.
  • In short we have to make 10 or more crappy models and get a great model by ensembling them.

Random Forests in scikit-learn

Most popular and important package for machine learning in Python. It is not the best at everything (e.g. XGBoost is better than Gradient Boosting Tree), but pretty good at nearly everything.

sklearn.ensemble.RandomForestClassifier(n_estimators=’warn’, criterion=’gini’, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=’auto’, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, class_weight=None)

There are so many parameters but we only care about few of them.

  • n_estimators : integer, optional (default=10) === The number of trees in the forest.
  • max_depth : integer or None, optional (default=None)
  • The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
  • min_samples_split : int, float, optional (default=2)
  • The minimum number of samples required to split an internal node: If int, then consider min_samples_split as the minimum number. If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.
  • min_samples_leaf : int, float, optional (default=1)
  • The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression. If int, then consider min_samples_leaf as the minimum number. If float, then min_samples_leaf is a fraction and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.
  • min_weight_fraction_leaf : float, optional (default=0.)
  • The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.
  • max_features : int, float, string or None, optional (default=”auto”)
  • The number of features to consider when looking for the best split:
  • If int, then consider max_features features at each split.
  • If float, then max_features is a fraction and int(max_features * n_features) features are considered at each split.
  • If “auto”, then max_features=sqrt(n_features).
  • If “sqrt”, then max_features=sqrt(n_features) (same as “auto”).
  • If “log2”, then max_features=log2(n_features).
  • If None, then max_features=n_features.
  • max_leaf_nodes : int or None, optional (default=None)
  • Grow trees with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.
  • bootstrap : boolean, optional (default=True)
  • Whether bootstrap samples are used when building trees.
  • oob_score : bool (default=False)
  • Whether to use out-of-bag samples to estimate the generalization accuracy.
  • n_jobs : int or None, optional (default=None)
  • The number of jobs to run in parallel for both fit and predict. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors
  • random_state : int, RandomState instance or None, optional (default=None)
  • If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
  • verbose : int, optional (default=0)
  • Controls the verbosity when fitting and predicting.
  • warm_start : bool, optional (default=False)
  • When set to True, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest.
  • class_weight : dict, list of dicts, “balanced”, “balanced_subsample” or None, optional (default=None)
  • Weights associated with classes in the form {class_label: weight}. If not given, all classes are supposed to have weight one. For multi-output problems, a list of dicts can be provided in the same order as the columns of y.
  • Note that for multioutput (including multilabel) weights should be defined for each class of every column in its own dict. For example, for four-class multilabel classification weights should be [{0: 1, 1: 1}, {0: 1, 1: 5}, {0: 1, 1: 1}, {0: 1, 1: 1}] instead of [{1:1}, {2:5}, {3:1}, {4:1}].
  • The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y))
  • The “balanced_subsample” mode is the same as “balanced” except that weights are computed based on the bootstrap sample for every tree grown.
  • For multi-output, the weights of each column of y will be multiplied.
  • Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified.
  • Attributes:estimators_ : list of DecisionTreeClassifier
  • The collection of fitted sub-estimators.
  • classes_ : array of shape = [n_classes] or a list of such arrays
  • The classes labels (single output problem), or a list of arrays of class labels (multi-output problem).
  • n_classes_ : int or list
  • The number of classes (single output problem), or a list containing the number of classes for each output (multi-output problem).
  • n_features_ : int
  • The number of features when fit is performed.
  • n_outputs_ : int
  • The number of outputs when fit is performed.
  • feature_importances_ : array of shape = [n_features]
  • Return the feature importances (the higher, the more important the feature).
  • oob_score_ : float
  • Score of the training dataset obtained using an out-of-bag estimate.
  • oob_decision_function_ : array of shape = [n_samples, n_classes]
  • Decision function computed with out-of-bag estimate on the training set. If n_estimators is small it might be possible that a data point was never left out during the bootstrap. In this case,oob_decision_function_ might contain NaN.

oob_score

We know that when validation score is much less than training score than we say that our model is overfitting. But this is not the complete case.

Overfitting may arises due to

  1. When our model is actually overfitting.
  2. When our training and validation set has different time period(in a time series data).

So we can not say exactly what was happening here so we use an additional and more reliable way that is oob_score.

When we train our model it does not use every row in each column so with oob_score we are giving that unused data into model as a validation set. This will lead to an OOB_SCORE.

Then if our oob_score is also deteriorating then we can confidently say that our model is overfitting.

One thing always to remember that we have to make such trees that are uncorrelated to each other then we ensemble them and get a great model out of them.

Real Life Analogy:

Imagine a guy named Dave, that want’s to decide, to which places he should travel during a one-year vacation trip. He asks people who know him for advice. First, he goes to a friend, tha asks Andrew where he traveled to in the past and if he liked it or not. Based on the answers, he will give Andrew some advice.

This is a typical decision tree algorithm approach. Andrews friend created rules to guide his decision about what he should recommend, by using the answers of Andrew.

Afterwards, Andrew starts asking more and more of his friends to advise him and they again ask him different questions, where they can derive some recommendations from. Then he chooses the places that where recommend the most to him, which is the typical Random Forest algorithm approach.

Feature Importance:

Another great quality of the random forest algorithm is that it is very easy to measure the relative importance of each feature on the prediction. Sklearn provides a great tool for this, that measures a features importance by looking at how much the tree nodes, which use that feature, reduce impurity across all trees in the forest. It computes this score automatically for each feature after training and scales the results, so that the sum of all importance is equal to 1.

If you don’t know how a decision tree works and if you don’t know what a leaf or node is, here is a good description from Wikipedia: In a decision tree each internal node represents a “test” on an attribute (e.g. whether a coin flip comes up heads or tails), each branch represents the outcome of the test, and each leaf node represents a class label (decision taken after computing all attributes). A node that has no children is a leaf.

Through looking at the feature importance, you can decide which features you may want to drop, because they don’t contribute enough or nothing to the prediction process. This is important, because a general rule in machine learning is that the more features you have, the more likely your model will suffer from overfitting and vice versa.

Below you can see a table and a visualization that show the importance of 13 features, which I used during a supervised classification project with the famous Titanic dataset on kaggle.

Difference between Decision Trees and Random Forests:

Like I already mentioned, Random Forest is a collection of Decision Trees, but there are some differences.

If you input a training dataset with features and labels into a decision tree, it will formulate some set of rules, which will be used to make the predictions.

For example, if you want to predict whether a person will click on an online advertisement, you could collect the ad’s the person clicked in the past and some features that describe his decision. If you put the features and labels into a decision tree, it will generate some rules. Then you can predict whether the advertisement will be clicked or not. In comparison, the Random Forest algorithm randomly selects observations and features to build several decision trees and then averages the results.

Another difference is that „deep“ decision trees might suffer from overfitting. Random Forest prevents overfitting most of the time, by creating random subsets of the features and building smaller trees using these subsets. Afterwards, it combines the subtrees. Note that this doesn’t work every time and that it also makes the computation slower, depending on how many trees your random forest builds.

1. Increasing the Predictive Power

Firstly, there is the „n_estimators“ hyperparameter, which is just the number of trees the algorithm builds before taking the maximum voting or taking averages of predictions. In general, a higher number of trees increases the performance and makes the predictions more stable, but it also slows down the computation.

Another important hyperparameter is „max_features“, which is the maximum number of features Random Forest is allowed to try in an individual tree. Sklearn provides several options, described in their documentation.

The last important hyper-parameter we will talk about in terms of speed, is „min_sample_leaf “. This determines, like its name already says, the minimum number of leafs that are required to split an internal node.

2. Increasing the Models Speed

The „n_jobs“ hyperparameter tells the engine how many processors it is allowed to use. If it has a value of 1, it can only use one processor. A value of “-1” means that there is no limit.

„random_state“ makes the model’s output replicable. The model will always produce the same results when it has a definite value of random_state and if it has been given the same hyperparameters and the same training data.

Lastly, there is the „oob_score“ (also called oob sampling), which is a random forest cross validation method. In this sampling, about one-third of the data is not used to train the model and can be used to evaluate its performance. These samples are called the out of bag samples. It is very similar to the leave-one-out cross-validation method, but almost no additional computational burden goes along with it.

Advantages and Disadvantages:

Like I already mentioned, an advantage of random forest is that it can be used for both regression and classification tasks and that it’s easy to view the relative importance it assigns to the input features.

Random Forest is also considered as a very handy and easy to use algorithm, because it’s default hyperparameters often produce a good prediction result. The number of hyperparameters is also not that high and they are straightforward to understand.

One of the big problems in machine learning is overfitting, but most of the time this won’t happen that easy to a random forest classifier. That’s because if there are enough trees in the forest, the classifier won’t overfit the model.

The main limitation of Random Forest is that a large number of trees can make the algorithm to slow and ineffective for real-time predictions. In general, these algorithms are fast to train, but quite slow to create predictions once they are trained. A more accurate prediction requires more trees, which results in a slower model. In most real-world applications the random forest algorithm is fast enough, but there can certainly be situations where run-time performance is important and other approaches would be preferred.

And of course Random Forest is a predictive modeling tool and not a descriptive tool. That means, if you are looking for a description of the relationships in your data, other approaches would be preferred.

Use Cases:

The random forest algorithm is used in a lot of different fields, like Banking, Stock Market, Medicine and E-Commerce. In Banking it is used for example to detect customers who will use the bank’s services more frequently than others and repay their debt in time. In this domain it is also used to detect fraud customers who want to scam the bank. In finance, it is used to determine a stock’s behaviour in the future. In the healthcare domain it is used to identify the correct combination of components in medicine and to analyze a patient’s medical history to identify diseases. And lastly, in E-commerce random forest is used to determine whether a customer will actually like the product or not.