Data and AI Journey: Jupyter Notebook vs Dataiku DSS (2). Data Visualization and Feature Selection.

Vladimir Parkov
8 min readMay 17, 2023

--

Data Storytelling, Kandinsky Style. Generated with Fusion Brain.

In the first part (available here), we took a first look at our dataset and did some data cleaning by removing duplicates. We also identified several outliers in the employee’s tenure that could impair the quality of our predictions.

Now we can use the power of Python and Dataiku DSS to understand the relationships between variables in the data.

To sum up, we have the resulting variable called “left” that we would like to predict. “Left” is a Boolean variable with only two possible values: 0 means the employee is still working with the company, and 1 represents the employee who has left the company.

The data dictionary

Making sense of relationships between variables

Let’s start by creating a heatmap of correlations. This is easy to make with both platforms:

Correlation heatmap in Jupyter with Python
Correlation heatmap in Dataiku DSS

There are only 8 variables in this heatmap because there are two columns with categorical values: “salary” (high, low, medium) and “department” (accounting, HR, sales etc. — 10 departments in total). We will deal with this issue later.

  • The heatmap shows a limited positive correlation between the number of projects, monthly hours, and evaluation scores. It makes sense: you take on more projects, work longer hours, and usually get higher evaluation scores.
  • Interestingly there is no significant negative correlation between the level of satisfaction and number of working hours.
  • Employees leaving correlates negatively with their satisfaction level and correlates positively with tenure (the longer you work, the less likely you leave).
  • Those who leave have lower satisfaction levels.

Data Visualizations with Python

You can start and make pairwise scatterplots in Python with a single command: sns. pairplot(df.name). I’ll show just one of these scatterplots: Satisfaction level plotted against the latest evaluation score.

Scatterplot in Jupyter

This picture could be more helpful. Here the complexity of Python turns into its strength. If you put enough effort into making sense of all its intricacies, you can create customized visualizations to get valuable insights.

Let’s see what would happen if we highlight data points for those employees that left the company:

Flexibility is the other side of complexity with Jupyter

This is truly “A picture is worth a thousand words” moment!

We immediately see two distinct clusters of those who left:

  • People with less than-average evaluation scores (0.445 to 0.575) and low satisfaction levels (0.35 to 0.46). No surprises here. Which factor here causes what? Do other external variables cause both of these?
  • People who got very high evaluation scores but have extremely low satisfaction levels. This is more interesting. In addition, we see only a few people with low satisfaction and evaluation scores leaving the company. Probably, the decision to quit takes longer after receiving low evaluation scores.

We can isolate these groups of people and plot them against other variables. For example, tenure, monthly working hours or salary. Let’s quickly do this:

Cluster 1: Low satisfaction coupled with low evaluation scores

For the first cluster, 835 employees left, and they did that after three years with the company working less than an average number of hours (144 hours vs 200 hours) and with less than an average number of projects (2 vs 3.8).

These people found themselves in a job that wasn’t the right fit. It’s common for people to take a few years to find their groove in a company before they start feeling restless and looking for new opportunities. Data analysis shows that, on average, these people took three years to find their place in the company before eventually leaving.

Cluster 2: Low satisfaction coupled with high evaluation scores

For the second cluster, 497 employees left, and they did that after four years with the company working more than the average number of hours (277 hours vs 200 hours) and with more than the average number of projects (6 vs 3.8).

These poor lads work tirelessly, achieving high evaluation scores, yet are stuck in the same position year after year. This frustrating experience can lead to burnout and, ultimately, a decision to leave after four years. No promotion for all four years of hard work may affect their decision to leave.

The 3–4 year mark is crucial in an employee’s journey with a company. By monitoring the number of working hours and paying attention to warning signs, companies can identify potential issues and take steps to address them. One area worth investigating is the company’s promotion policies, specifically at the four-year mark.

You have the first significant insights you can share with business leaders to make data-driven decisions!

Python is the Swiss Army Knife for data.

But you must familiarize yourself with its tools and functions: read documentation, explore Stack Overflow or ask ChatGPT if you can describe what you are trying to achieve but struggle with complex syntax.

For example, we can also create customized boxplot and histogram to represent data visually to compare satisfaction and salary levels:

The number of employees who left or satisfaction levels for low and medium salaries are the same. The trend changes with high salaries: fewer high-paid people left, but those who left had significantly lower satisfaction levels. Money can’t buy work satisfaction.

Data Visualizations with Dataiku DSS

In Dataiku DSS, the process of creating data visualizations is virtually effortless. Not a single line of code is needed to get the insights.

Just open the dataset, go to “Charts”, and use drag-and-drop to generate scatterplots for any variables.

Testing it with satisfaction level plotted against the latest evaluation score, we quickly get the same picture:

Then by applying the filter, you can quickly show only those data points corresponding to employees leaving the company:

Using filters again, we can quickly identify the cluster of 835 employees with low satisfaction coupled with low evaluation scores:

With Dataiku DSS, you can analyze visual data in seconds using pairwise visualization and statistical tests.

Moreover, you can immediately create dashboards from any visualizations you have made and immediately start building your data storyline!

In Dataiku DSS, you can immediately start building your interactive dashboards

Choosing AI models

We want to predict whether an employee will leave the company. The outcome can be either 1 (employee left) or 0 (employee stayed with the company).

This prediction task is called binary classification, which leaves us with two possible approaches to machine learning:

  1. Logistic regression
  2. Tree-based models (random forest, decision tree)

Let’s start with logistic regression. It is sensitive to outliers and assumes no multicollinearity among independent variables (i. e. they shouldn’t be correlated).

Feature Selection, Engineering and Extraction

We have three potential problems to solve before

  1. There is some correlation between independent variables: for example, between monthly working hours and the number of projects. We’ll explore this further.
  2. Our dataset has outliers in the “time_spend_company” — 824 rows identified previously.
  3. There are also two categorical variables (“salary” and “department”) that we will need to convert to numerical.

Identifying Multicollinearity

We can use the Variance inflation factor (VIF) to measure how much an independent variable’s behaviour (variance) is influenced, or inflated, by its interaction/correlation with the other independent variables.

In Python, we can do it with a little bit of coding:

A rule of thumb for interpreting the variance inflation factor: 1 = not correlated. Between 1 and 5 = moderately correlated. Greater than 5 = highly correlated.

We may have a problem here as most of the variables are correlated. For now, let’s keep all the variables and see whether the quality of logistic regression prediction will suffer.

Removing outliers

In Python, we create a new data frame removing the rows that contain “time_spend_company” values outside of 1.5 of the interquartile range. People working less than 1.5 years (none) or more than 5.5 years (824 rows) with the company will be removed from the dataset.

As usual, removing outliers is way easier with Dataiku DSS. You just need to open the dataset, click on the “time_spend_company” column, and choose the criteria for outliers that need removal.

Removing these nasty outliers has never been easier!
The resulting outliers-free dataset

The dataset keeps shrinking! Out of the original 14,999 rows, we first removed 3,008 duplicates and now removed 824 outliers. 11,167 rows remain.

Encoding the categorical variables

In Python, one-hot encoding is done via the Pandas Get Dummies — pd.get_dummies() function. This function converts categorical variables into binary features, with each possible category getting its own column of either 0s or 1s.

One-hot encoding in Python

As a result, the categorical column “salary” is replaced by three numerical columns (salary_high, salary_medium and salary_low) for each row is either 0 or 1 to reflect the level of salary an employee receives. Likewise, the “department” categorical column is replaced by ten columns by the number of departments in the company.

What about Dataiku?

Dataiku DSS takes care of the nitty-gritty details of model training, so you don’t have to. You don’t even need to encode text values explicitly!

The platform handles this under the hood, allowing you to focus on the bigger picture and derive insights from your data.

Now we are ready to build our first model! Follow me to the next part!

--

--

Vladimir Parkov
0 Followers

Driving transformational value through Data and AI