Data and AI Journey: Jupyter Notebook vs Dataiku DSS (1).

Vladimir Parkov
8 min readMay 12, 2023

--

A nightingale against a backdrop of Jupiter generated with BlueWillow

The Call to Adventure: Google Advanced Data Analytics Professional Certificate

In a world where technology is rapidly advancing, and AI breakthroughs are constantly being made, data literacy skills are more critical than ever before. You don’t need to be a data expert, but you must understand how data is extracted, transformed, and analyzed to drive better business outcomes. That’s why I completed the Google Advanced Data Analytics Professional Certificate. It’s an excellent introductory course that teaches you how to use Python to conduct statistical analysis, build models for regression and classification, and even delve into more advanced machine learning models like random forest and XGBoost.

Most importantly, this course teaches you to do more than just use the tools for the sake of it. It teaches you to tell stories with data.

Telling stories with data is the key to unlocking the power of data.

If you’re interested in building your data literacy skills but don’t know where to start, I highly recommend the Google Advanced Data Analytics course on Coursera. Or, if you want to dip your toe in the water first, you can start with the Google Data Analytics course, which introduces you to SQL, R, and Tableau to help you build data storytelling skills.

I’m not a coding expert, but I do have a knack for making sense of things and seeing the big picture in business. I’m just like you, a regular non-technical user accustomed to working with spreadsheets and pivot tables. So if I can do it, I know you can too!

Data and AI for the Masses: the Rise of the Data-ocracy!

Democratising data and AI are not just buzzwords but are the key to building a better world through better data-driven decisions. The road to unlocking the value from data is often bumpy, and many businesses struggle with unclear objectives and obscure project management processes.

A whopping 85% of AI projects fail due to this very reason.

The gap between data professionals and business decision-makers is still significant, leaving non-technical users feeling overwhelmed and frustrated when working with data programming languages like Python and SQL. That’s where data science platforms like Dataiku DSS come in.

I am a fan of Dataiku DSS. I used it for the last several months in my pet projects, wrangling data sets from Kaggle. I am amazed at how intuitive and easy it is to use this platform and build machine-learning models in minutes.

Dataiku DSS is a game-changer, providing an intuitive user interface and drag-and-drop features that make it possible for business users and subject matter experts to work with data without specialized technical knowledge.

If you want to try Dataiku DSS for yourself, the Dataiku Academy is a great place to start. You can learn about Data Analytics and Machine Learning in an easy-to-grasp and visual manner. Data science platforms like Dataiku DSS are critical for bringing data literacy skills into the world at scale.

The Data Journey: From Messy Data to Valuable Insights

I want to share my practical experience of embarking on a Data Journey.

The last course in Google Advanced Data Analytics Certificate is a capstone project. You are given a CSV file containing data on 14 999 employees: employee-reported job satisfaction, number of hours an employee works on average, tenure, salary, number of projects etc. Most importantly, whether or not the employee left the company. You can take a look at the dataset in more detail on Kaggle.

The goal of the capstone project is to build a model that predicts whether or not an employee will leave the company — a very clear and important business goal. If you can predict employees likely to quit, it might be possible to identify factors that contribute to their leaving and reduce attrition.

So I decided to complete this capstone project twice. First, using Python with Jupiter Notebook as the Google Advanced Data Analytics Course proposed.

Second, I took the road less travelled and tried out Dataiku DSS to complete this project again.

Let’s roll our sleeves and dive in!

If You Gaze Into the Data, the Data Gazes Also Into You

After contextualizing the business scenario and understanding the business problem, you will start exploratory data analysis (EDA) to get a first look at the data.

First look at the data: Python with Jupyter
First look at the data with Dataiku DSS

In the capstone project, we need to predict whether the employee will likely leave the company based on several other variables. Some variables are more valuable than others; others may not be useful at all.

This is the stage that could already bring critical insights.

Do you have all the needed data? Are there missing pieces in your puzzle?

You need to focus on what variables you have but also identify which variables you might be missing. Are there ways to enrich your dataset with other sources of data? For example, the dataset has information about the department but no information about the job grade or role in that department. Predicting if an employee with a higher job grade quits is more valuable than predicting if an entry-level employee leaves.

One of the ways that I propose to frame the process is from the value chain perspective: do all the critical steps of a business process measured and captured by data? What are the gaps? Does data tell the whole story end-to-end?

Don’t trust that your data is complete, unbiased, and reflects the actual state of things in the business. Consult subject matter experts. Toggle your critical analysis switch to the max.

Many companies now proudly say that they are “data-led”, but being led by bad data is the worst thing to happen to your company.

Now on to the data wrangling!

Data Exploration (initial data analysis and data cleaning)

First, you need to get the descriptive statistics of the variables: how many values are out there, how these values are distributed, are there any missing values etc.

In Python, you need to run several consecutive commands like hr.shape, hr.info(), hr.describe to get snapshots of the statistics (where “hr” is the name of the original dataframe).

In Dataiku DSS, the process is much easier: you just go to the “Statistics” tab for the dataset and choose what variables and what statics you need, and that’s it.

Descriptive statistics in Python with Jupyter
Descriptive statistics in Dataiku DSS

There are some significant differences between using Python with Jupyter and Dataiku DSS for data analysis:

Python with Jupyter:

  • You need to import relevant packages first before working with them. For example, numpy, which is used for working with arrays and mathematical operations in Python or pandas, which is used for data manipulation and analysis
  • You need to run separate commands to load and display your dataset and its transformations. You need to rerun the command to display transformed datasets to see the effect of your changes.
  • The learning curve for Python is not steep but requires an understanding of syntax, data structures, and relevant libraries.

Dataiku DSS:

  • You can easily create a new project, upload your dataset, and get a familiar spreadsheet view of your data.
  • Changes to the dataset are made through “visual recipes” added to the visual flow of the project. The changes are interactive: you can see how the transformed dataset looks as soon as you change it.
  • Dataiku DSS is intuitive and user-friendly, requiring way less learning curve than Python with Jupyter.

Removing duplicates

Removing duplicates is easy with both platforms. In Jupiter, you drop duplicates and save the resulting data frame in a new variable running the command hr_no_duplicates = hr.drop_duplicates().

In Dataiku DSS, you run the “Distinct” recipe that creates a new dataset with distinct values.

The results are the same: 3008 duplicates are discovered with distinct 11991 rows remaining.

Detecting outliers

Outliers in your data can be a real headache, but detecting them is crucial for accurate analysis. This is where Dataiku DSS stands out from the crowd, leaving Jupyter in the dust. While Python does have libraries for creating visualizations, the coding required can be extensive, and there are no pre-built templates to help you complete standard charts and graphs.

For instance, we want to detect outliers in the ‘time_spend_company’ column, which shows how long people have worked for the company.

In Python, this is the first time I struggled to learn more about seaborn and to identify the proper parameters to get a decent-looking boxplot. It won’t be interactive, so you must write several lines of code to compute the percentiles and interquartile range and calculate the number of outliers.

With Dataiku DSS, however, detecting outliers is a breeze. Just click on the column, select “Analyze,” and voila — you’ll have a gorgeous, interactive boxplot with all the relevant data. It’s that simple.

Creating even a simple boxplot in Python can be tedious
Dataiku DSS makes outliers detection super easy, barely an inconvenience!

We immediately see that data points exceeding 5.5 years are outliers, with 824 “outlying” employees.

Some ML algorithms are more sensitive to outliers than others. For example, a linear regression model will likely be overfitted with outliers and perform worse in making predictions with new data.

In this case, you either remove the outliers (and potentially lose precious insights) or preserve them by scaling the data. One type of scaling you can do is standardization which transforms each value within a feature so that all the variables collectively have a mean of zero and a standard deviation of one.

Tree-based models are resilient to outliers as they continuously split the data into smaller regions, and outliers may fall into their own region, so the prediction quality won’t suffer.

For now, let’s keep the outliers in.

In the next part, we will continue our Data journey and start by looking into feature selection and engineering to decide what variables and in what form we will use for our machine learning models.

--

--

Vladimir Parkov
0 Followers

Driving transformational value through Data and AI