Data and AI Journey: Jupyter Notebook vs Dataiku DSS (4). Random Forest.

Vladimir Parkov
10 min readMay 25, 2023

--

Random Forest generated by kandinsky-2

In the last part, 3 (here), we tried to predict whether an employee would leave the company with Logistic Regression.

You can also check out Part 1 (here) and Part 2 (here) to get the full context of the story so far.

Remember how we improved the predicting power of the Logistic Regression model by changing its threshold? Let’s also calculate another essential and helpful evaluation metric — ROC-AUC score and illustrate how it looks for our Logistic Regression model:

The ROC-AUC (Receiver Operating Characteristic — Area Under the Curve) score is used to evaluate how well a classification model can distinguish between positive and negative classes.

It can be viewed as the probability that the model ranks a random positive example more highly than a random negative example.

It calculates the area under the ROC curve, which plots the true positive rate (TPR) against the false positive rate (FPR) at various thresholds.

· TPR tells us how well the model identifies positive cases correctly.

· FPR tells us how often the model incorrectly identifies negative cases as positive.

A higher ROC-AUC score indicates better model performance. It is widely used because it handles imbalanced data, is threshold-independent, and facilitates model comparison.

As a reminder, let’s create aggregate table with all the evaluation metrics achieved by the Logistic Regression:

Now we are ready to make predictions with the Random Forest algorithm!

Random Forest with Jupyter Notebook

We will start with the version of the dataset that includes outliers. Random forest shouldn’t be significantly affected by these outliers as it uses multiple decision trees and aggregates their predictions. Outliers usually have a limited impact as they tend to get averaged out and isolated in separate leaf nodes during the tree-building process.

As a result, we get a larger training and testing dataset of 11,991 rows. We create splits of the datasets and define the initial parameters for our Random Forest model:

The hyperparameters that we assigned for the Random Forest model are the following:

· max_depth: specifies the maximum depth of each decision tree in the random forest. It is given as a list [3, 5, None] where each value represents a different option for the maximum depth. ‘None’ means there is no maximum depth limit.

· max_features: represents the maximum number of features to consider when looking for the best split in each decision tree. Here, it is set as [1.0], which means all features will be considered.

· max_samples: specifies the maximum number of samples for training each decision tree. The values [0.7, 1.0] indicate that 70% and 100% of the samples will be used.

· min_samples_leaf: sets the minimum number of samples required at a leaf node. The values [1, 2, 3] indicate different options for this parameter.

· min_samples_split: represents the minimum number of samples required to split an internal node. The values [2, 3, 4] represent different options for this parameter.

· n_estimators: defines the number of decision trees in the random forest ensemble. The values [300, 500] indicate two different options for the number of estimators (decision trees).

· cv=4 means that the data will be divided into 4 equal parts (folds), and the model will be trained and evaluated 4 times, each time using a different fold as the validation set and the remaining folds as the training set.

The goal is to find the best combination of hyperparameters that maximizes the ROC-AUC score (refit=’roc_auc’).

Since one of the tasks is to determine the optimal hyperparameters of the model, we split our dataset into three parts:

The test set contains 20% of the original data

The remaining 80% is divided into two parts:

· Validation set: 25% of remaining data or 20% of the original data

· Train set: 75% of the remaining data or 60% of the original data

This is the first time we will have to wait for the model to go through all the hyperparameters on the training data. Random Forest is a relatively fast algorithm to train compared to complex deep learning models, and it benefits from its ability to parallelize training across multiple decision trees.

The training time for Random Forest depends on the number of trees in the forest and the complexity of the dataset.

Even this “fast” algorithm took over 14 minutes to train in this case! If you reduce the number of hyperparameters, especially the number of decision trees to, say, 100, you will train your model in a matter of seconds.

Training Random Forest model can take a lot of time!

It is a good idea to save your model after training.

In Python, you can do it using Pickle — a module that conveniently saves objects to a file.

In Python, Pickle allows you to save the state of an object, including its attributes and methods, to a disk or send it over a network.

Pickle is commonly used for keeping trained machine learning models so that they can be loaded and reused later without retraining them.

In my case, that model was saved to a .pickle file of 1.6 MB in size.

So, what were the best ROC-AUC score our Random Forest model achieved, and what were its optimal parameters?

The ROC-AUC score of 0.98 is significantly better compared to the ROC-AUC score of 0.81 for Logistic Regression.

Now let’s see how Random Forest performed on validation and testing sets and compare it also with the Logistic Regression metrics:

Random Forest performs consistently well on both validation and testing sets! It is also vastly superior to Logistic Regression in its predicting power.

Let’s see the confusion matrix for the predictions made by Random Forest on the testing set:

Out of 2,399 data points, Random Forest was wrong only for 48 cases!

Random Forest is astonishingly powerful: it is beautifully simple to understand and can effectively handle high-dimensional data with many features without feature scaling or selection.

Random Forest provides feature importance measures, enabling identifying the most influential features in the prediction process. This helps in understanding the underlying data patterns and making informed decisions.

Let’s see what are most important features of our dataset that Random Forest identified:

Compare it with the most important features of the Logistic Regression algorithm:

There are some differences in the most influential features of both algorithms. Both algorithms select satisfaction level and tenure as important. Random Forest considers the number of projects the employee works on, the last evaluation score, and the average monthly working hours as important.

Logistic Regression considers work accidents, whether the salary is high and promotions over the last 5 years significant.

Lastly, let’s see what type of decision tree our Random Forest algorithm has built to provide such reliable predictions:

Remember that our optimal tree has a depth of 5: we reach from the root node to the leaf node in no more than 5 steps.

We can see each node’s decision criteria, gini, samples, value and assigned class after each decision.

For example, in the root node, we have the following parameters:

· Satisfaction level <= 0.475. If this is true for the employee in the testing data set, we move to the left branch and the next node; if not — to the right branch.

· Gini = 0.283. The Gini impurity value shows how “pure” the node is. A Gini impurity of 0 indicates a perfectly pure node where all the samples in that node belong to the same class. A Gini impurity of 1 indicates a completely impure node, where the samples in that node are evenly distributed across different classes.

· Samples show the number of data points in the node. There are 3,633 that we will check against the satisfaction level criteria.

· Value: show the total number of samples within that node. Here we have 4,178 samples that have class “0” (stay with the company) and 858 samples that have class “1” (left the company).

This path is repeated for each employee in our testing set to predict whether they will leave the company.

Random Forest with Dataiku DSS

As I mentioned before, with Dataiku DSS, you can literally train your model with a couple of clicks with minimal data cleaning.

Just open your dataset, click on the target value and choose “Create prediction Value” Dataiku DSS will offer optimal algorithms based on your data!

In our case, it automatically identified Logistic Regression and Random Forest as the best prediction algorithms for this task!

Now let’s see at the modelling design for Random Forest that Dataiku DSS proposes to use by default.

Dataiku DSS proposes to generate 100 decision trees, a min sample of 1 in the leaf node and test two depth of trees options: 6 and 13.

Remember that in Python, the identified optimal parameters were 500 trees and a tree depth of 5.

Let’s train the model and see the results:

Random Forest is obviously better; no surprises here. It has achieved a ROC-AUC score of over 98% compared to the Logistic Regression AUC-ROC score of only 84%.

Random Forest with 100 trees in Dataiku DSS achieved the same 98% level of AUC-ROC as the Random Forest trained with Python on 500 trees.

That’s why with Dataiku DSS, it took only 4 seconds to train the model as, by default, it took fewer hyperparameters to go through. It took me 6 seconds to train the model with the same parameters in Python.

Another interesting point is that the most important variables identified by Random Forest with Dataiku DSS slightly differ from those of Random Forest generated by Python.

While the most important feature is still satisfaction level, slight differences exist in the importance score of other features compared to Random Forest generated by Python.

Let’s look at the confusion matrix and see how well Random Forest did and compare it to the confusion matrix from Python’s Random Forest:

We got almost exactly the same results!

Lastly, let’s also look at one of the decision trees generated by Dataiku DSS. You can quickly go through any of the trees to understand their structure:

Conclusion: A data journey of a thousand business discoveries begins with a single click

Of course, I barely scratched the surface, but I hope through this series of articles now you have a better understanding of how both products work in a practical way.

The choice of tools for your data and AI journeys is yours. But it is clear how convenient it is to use Dataiku DSS to get results fast!

I wish Dataiku would collaborate with Coursera to create an AI Democratization certificate for Business Users.

Dataiku Academy is an excellent platform for self-learning, but it pains me to see that such great content could have reached a much wider audience to learn more about this great product.

The major strength of Dataiku DSS is that you always keep the end-to-end focus in your data and AI projects. It is much easier to track the flow of information, the insights and the time to value. For example, you can easily share and follow all steps of your projects in an easy visual way. Just look at the project flow at Dataiku DSS:

This intuitive visual interface and level of interactivity are the strongest features of Dataiku DSS.

It is very easy to see what steps were made to transform datasets, quickly access the training results for your ML models and fine-tune them. You can click on any steps and get all the details or implement needed changes. All the changes you make are immediately visible and interactive.

Python is now much more manageable, even for business users like myself. Especially with the help of ChatGPT, you can now embed right into the Jupyter Notebook interface. For example, you can use ChatGPT — Jupyter — AI Assistant extension for Chrome available here.

Thank you for being with me on this journey. If you have any comments or questions, feel free to contact me!

--

--

Vladimir Parkov
0 Followers

Driving transformational value through Data and AI