Online Shoppers Purchasing Intention -Part 2- “Random Forest “

Emine Kayalı
4 min readMay 8, 2020

--

In this story ,I will try to describe how I built a random forest and a decision tree model for my project. If you didn’t read Part-1 you can access easily click on this Part1.

— Random Forest —

Random Forest creates multiple decision trees and combines them to get a more accurate and stable forecast.It instead of searching for the most important feature, searches for the best feature among a subset of random features.

*Features of Random Forests

  • It runs efficiently on large data bases.
  • It can handle thousands of input variables without variable deletion.
  • It has methods for balancing error in class population imbalanced data sets.

In Part-1, we did some analysis now we need to change our data distribution.

Now True and False balance!

sklearn.model_selection.train_test_split

Split arrays or matrices into random train and test subsets.

Parameters:

Test Size: If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. If train_size is also None, it will be set to 0.25.

Random_State: If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

Stratify: If not None, data is split in a stratified fashion, using this as the class labels.

Out-Of-Bag (oob) : Accuracy of our random forest model.Used to evaluate the performance of the data.

How can we raise the oob score of our model?

**GridSearchCV:

Hypertuning parameters is when you go through a course to find the optimal parameters for our model to develop accuracy. GridSearchCv works by training our model several times on a range of parameters that we specify. In this wise, we can test our model with each parameter and resolve the optimal values to get the best accuracy scores.

*Gini impurity measures the degree or probability of a particular variable being wrongly classified when it is randomly chosen.

  • Entropy is a measure of disorder or uncertainty and the goal of machine learning models.
  • *min_samples_leaf determines the number required to minimize the inner knot.
  • max_features , allowed maximum number of features
  • n_jobs is the number of jobs to run in parallel.-1 means using all processors.

CONFUSİON MATRİX

Confusion matrix is a table that is often used to describe the performance of a classification model on a set of test data for which the true values are known. It allows the visualization of the performance of an algorithm.

Now, we can very good predict which customer has no revenue but I want to know which customer has revenue. That’s why I need to check recall and precision parameters.

The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal. The formula for the F1 score is:

F1 = 2 * (precision * recall) / (precision + recall)

It is one of the most commonly used measurements to evaluate the performance of machine learning algorithms, especially when imbalance data sets are available.It explains how good the model is estimating.

References:

Scikit-learn.org

You can also check my other stories:

EDA — Is Turkey in The Earthquake Risk Zone?

CLTV — Customer LifeTime Value

--

--

No responses yet