Ensemble Learning and RandomForests in R

Dr. GP Pulipaka
3 min readJun 7, 2016
Ensemble Learning

R supports ensemble learning. Basically, ensemble learning is curating the multiple predictions and combining them to generate strong overall prediction to overcome the assumptions and challenges from each method such as nearest neighbor models, logistic regression, Bayesian method, classification decision trees, or discriminate analysis. For an instance, Random Forest predictions and simple linear model and a vector machine can be combined to derive a strong prediction outcome. As the diversity of models increases, the ensemble learning increases significantly. Even combining multiple models in similar nature does not provide the best performance on the ensemble learning, the diversity of the models is what provides the best prediction possible outcome with strong results. The ensemble learning models created from the combinations of multiple models surpass the performance of prediction outcomes from a single model.

RandomForest machine algorithm is considered as one of the most efficient and best algorithms available for computing the predictions. Random Forest is a combination of number of decision trees. Any of the decision tree algorithms branches down to all the feature selection and comparisons autonomously to determine which feature will provide the best optimized information gain for reducing the entropy and disorder in the dataset and increase the consistency in the symmetrical branches. This allows the evaluation of the first node where the partition of the data occurs dependent on the threshold defined for the split. The process occurs recursively throughout the branch nodes by combining all the features till it reaches a point of decision or a leaf, to the point when the entropy becomes minimum or zero. The splits continue to occur if the entropy is more than zero, till it reaches zero. Parallel computing of training the dataset multiple times on samples that are generated randomly provides best prediction performance of the task. Voting can be performed on individual tree decision results or accepting the results that are repeated multiple times on trees can be considered as the final predictions. For example, if the random forests display 20 decision trees evaluating an individual as a credit loan applicant with good label and 25 decision trees evaluate the individual with best label, the final prediction considers the count of number of decision trees that resulted the data trend.

Most of the machine learning models tend to create noise in the data modeling. However, Random Forests create least noise as a machine learning model. Any time the split occurs, Random Forest will randomly choose the features whenever the partition occurs, creating a decorrelation of the trees to assure they are chosen randomly. Random Forest is used for both classification and regression tasks. Some of the classification tasks include predicting the patient’s chronic disease development or cell disintegration, credit application, the psychological behavior and buying patterns of consumer to a specific new product released in the market. The regression tasks can include, predicting the weather average temperatures, and scores. In case of regression tasks, the outcomes from the trees are average to predict the outcome. The prediction can also apply to speech-to-speech analytics. The Random Forest can be mathematically be represented as { i (z, Θ j ), j = 1, . . . }. R provides the package randomForest for implementation of Random Forests. Python programming language supports randomForest as well. The disadvantage of Random Forest is it requires very deep debugging of the machine-learning algorithm to find out any variations from each feature of the tree and evaluate each tree to gain insights on the factors that influenced such prediction outcome.

References

Merrett, R. (2016). Random Forest — the go-to machine learning algorithm. Retrieved March 1, 2016 , from http://www.techworld.com.au/article/594916/random-forest-go-to-machine-learning-algorithm

Paruchuri, V. (2012). An Intro to Ensemble Learning in R. Retrieved March 1, 2016 , from http://www.r-bloggers.com/an-intro-to-ensemble-learning-in-r/

--

--

Dr. GP Pulipaka

Ganapathi Pulipaka | Founder and CEO @deepsingularity | Bestselling Author | Big data | IoT | Startups | SAP | MachineLearning | DeepLearning | DataScience