Data and AI Journey: Jupyter Notebook vs Dataiku DSS (3). Logistic Regression.

Vladimir Parkov
9 min readMay 22, 2023

--

Logistic Regression generated by Fusion Brain

In the previous two parts (you can read these here and here), we explored and transformed our dataset with data on employees, some of which left the company, while some kept working at it. After removing duplicates and outliers and encoding the categorical variables, we have a dataset with 11,167 rows and 20 potential features + 1 outcome variable (“left”).

With Dataiku DSS, we don’t need to do anything with categorical variables: they will be encoded automatically during the model training phase.

One of the ways to put what we are doing with data is to contextualize it as a pyramid of data science needs, which was beautifully done by Monica Rogatti back in 2017:

The Data Science Hierarchy of Needs by Monica Rogatti

So far, we moved from the bottom of the pyramid to the “Aggregate / Label” level. We will not reach the top of the pyramid in our Data and AI journey here. Logistic Regression and Tree-based Models that we will use are simple machine learning algorithms that will be likely enough to ensure high-quality prediction.

Deep Learning algorithms are employed for much more complex tasks and can learn feature representations directly from raw input data. It is often used for problems related to image and speech recognition, natural language processing, and autonomous driving.

While traditional machine learning algorithms can handle structured data, deep learning is particularly well-suited for unstructured data such as images, videos, and text.

Splitting the Dataset for Logistic Regression

In Python, we first need to isolate the outcome variable and then create a new variable for isolating the features we want to use to predict the outcome. Let’s use all 20 independent variables as predicting features for now.

Isolating target and predicting variables

Now we can split the data into training sets and test sets. We will use 80% of the data for training and the rest 20% of the data to estimate how well model performs predictions on new data.

Splitting data into training and testing sets

Our training set contains 8,933 rows, precisely 80% of the original dataset with 11,167 rows.

You can use any number for “random_state” and then use this number to get reproducible split results. For example, when you need to run the model on another machine and replicate the split result. You will get a different result every time you run split without specifying “random_state”.

With Dataiku DSS, the split is done during the training stage, and you don’t need any separate dataset preparation or wrangling at this stage, but for the sake of comparison, let’s also split the data set by the same 80/20 ratio.

You don’t need to know any syntax or write a single line of coding to split the dataset. You just apply the visual “Split” recipe and specify how to perform the split.

Splitting data into training and testing sets with Dataiku DSS is very easy

Even if you put the same number you used for Python’s “random_state” into “Set Random Seed”, you will get the different split results Dataiku DSS.

Building Logistic Regression in Python

First, we create an instance of the logistic regression model and fit the model to the training data.

Our trained model will then predict the test data’s target variable, which is done with a couple of lines of code:

The “max_iter” parameter determines the maximum number of iterations for the model to converge if it doesn’t reach convergence earlier.

How well is Logistic Regression performed on this new data? Let’s look at the major performance metrics and also plot the confusion matrix for the model:

Our training test contains 2,234 rows of data that the trained model has not seen before. 364 (16%) employees left the company, and 1,870 (84%) stayed with the company.

Let’s look at the performance metrics in more detail:

Precision measures the share of classifications that the model labelled as true and that actually turned out to be true:

- Model predicted 2,031 employees would stay with the company (predicted value for “left” = 0), and for 1,761 employees this was correct (predicted and actual value for “left” = 0) — 87%

- Model predicted 203 employees would leave the company (predicted value for “left” = 1), and for 94 employees this was correct (predicted and actual value for “left” = 1) — 46%

To sum up: When the model predicts that an employee will stay, it will likely be correct. When the model predicts that employees will leave, this prediction is poor.

Recall measures the share of classifications that the model labelled as true out of all classifications that actually turned out to be true:

- Model predicted 1,761 employees would stay with the company (predicted and actual value for “left” = 0) out of 1,870 who actually stayed (actual value for “left” = 0) — 94%

- Model predicted 94 employees would leave the company out of 364 who actually left (actual value for “left” = 1) — 26%

To sum up: The model correctly predicted the vast majority of employees that would stay but poorly predicted the share of employees that would leave the company.

F1-score is the harmonic mean of precision and recall.

Accuracy measures the share of correct classifications in the data set.

Out of 2,234 employees, the model correctly predicted what would happen for 1,855 employees (who left or stayed) — 83%.

Overall, this instance of the Logistic Regression model is poorly predicting people leaving the company, which may be due to class imbalance: the vast majority of people in the dataset stayed with the company.

Changing the threshold

By default, the threshold of logistic regression is set to 0.5.

If the predicted probability of belonging to the positive class is greater than or equal to 0.5, it is classified as positive (employee will leave the company); otherwise, it is classified as negative (employee will stay).

We can change the default threshold from 0.5 to, say, 0.4, and it will result in a higher number of people classified as those who will leave, but this will also increase the number of false positives (incorrectly classifying people loyal to the company as those who will leave).

We see a significant improvement in performance metrics for those who leave; accuracy and f1-score also have improved, but the model also increased the number of false positives from 109 to 146 by 33%.

As always, there is a tradeoff between the metrics. If your business goal is to reduce attrition by focusing on those who leave and it is acceptable to get more false positives, then you can fine-tune your model by changing the threshold.

Reducing the number of features

Another issue may be the multicollinearity of independent features that we observed earlier. Let’s look at the most important features using Recursive Feature Elimination (RFE). RFE is an iterative feature selection technique that starts with all features and eliminates the least important one in each iteration:

Let’s remove unimportant features — most of the encoded categorical features and average monthly hours and data on time spent with a company (as it will show the large degree of multicollinearity). Let’s retrain the model and see how it has performed:

By limiting the number of features based on their importance and multicollinearity level, we improved prediction precision for those who will leave the company.

Logistic Regression with Dataiku DSS

Dataiku DSS is focused on getting you the business result as soon as possible, so to get predictions, you just need to open the dataset, click on the “left” column and choose “Create prediction model”. And that’s it!

With Dataiku DSS you create your ML models literally with a couple of clicks!

And if you choose the option of “AutoML”, you will let Dataiku create your models, and it will propose the optimal ML algorithms based on your data!

It will also automatically do the train-test split and encode all the categorical numbers for your models!

Let’s review the modelling design proposed by Dataiku:

Dataiku DSS correctly identified the most suitable algorithms based on our dataset, Logistic Regression and Random Forest!

For now, let’s focus on Logistic Regression. Dataiku DSS allows the option to employ L1 and L2 regularization to optimize the model from the start.

As a refresher:

L1 Regularization (Lasso Regularization) encourages the model to set some coefficients to zero. This has the effect of performing feature selection by shrinking the less important features to zero, effectively removing them from the model.

L2 Regularization (Ridge Regularization) encourages the model to distribute the impact of each feature across all the coefficients. It helps in preventing overfitting by keeping the model weights small and well-distributed.

Dataiku DSS, by default, proposes to keep all the features

Let’s run the training with the parameters offered by Dataiku DSS and look at the results for Logistic Regression:

Let’s compare the results with the confusion matrix we got from Python. Due to rounding, Dataiku, DSS used 2,246 data points for testing the dataset vs 2,234 data points in testing the dataset with Python.

Notice that sklearn in Python plots a confusion matrix showing the True Positive class first. However, every other source I have seen sets up the matrix with the True Positive class shown first. That approach seems better, and Sklearn’s way of showing the confusion matrix is, well, confusing. Let’s invert the confusion matrix for Python for better side-by-side analysis to aid side-by-side comparison.

Precision:

- Dataiku DSS predicted 1,697 people would stay with the company and was 96% correct (vs 87% with Python).

- Dataiku DSS predicted 549 people would leave the company and was 59% correct (vs 46% with Python).

Recall:

- Dataiku DSS predicted 1,637 employees would stay with the company out of 1,860 who actually stayed: 88% (vs 94% with Python)

- Dataiku DSS predicted 326 employees would leave the company out of 386 who actually left: 84% (vs 26% with Python)

F-1 Score is 70% vs 81% with Python. Accuracy is 87% vs 83% with Python.

Overall, we see more balanced predictions across classes. Dataiku optimized several hyperparameters by default as early as during the initial training of the model. For example, we see that Dataiku DSS autoatically determined the threshold for Logistic Regression to optimize the F1-score and L2 regularization, providing more robust predictions. You must fine-tune your model and optimize hyperparameters through trial and error.

Dataiku DSS adjusted threshold to optimize for f1-score

Dataiku DSS also used 5-fold cross-validation as well as stratified sampling that improves reliability and robustness of the model:

5-fold cross-validation is a technique to test how well a model can predict new data.

It works like this:

  1. We split our data into 5 equal parts called “folds.”
  2. We train our model on 4 folds and test it on the remaining fold.
  3. We repeat this process 5 times, each time using a different fold as the test set.
  4. Finally, we average the results to see how well our model performs overall.

Stratified sampling is a technique used in data splitting to ensure that the distribution of classes or categories in the original dataset is preserved in the split subsets.

There are so many useful features of Dataiku DSS to fine-tune your model, improve its performance and, most importantly, explore the explainability of the algorithm that will immensely empower your data storytelling ability!

For now, we can explore what features Dataiku DSS determined as most important in our Logistic Regression:

The most important five features are similar to what our initial Logistic Regression modelling with Python showed (although the ranking is different): salary, work accidents, tenure, number of projects and satisfaction level.

In the next part we will finish our Data and AI journey in a Random Forest. As always, stay tuned!

--

--

Vladimir Parkov
0 Followers

Driving transformational value through Data and AI