Data Science Introduction For People Who Know Just Enough R or SQL To Get By

Are European Work Environments Really Women-friendly?

Can you guess what this spreadsheet is about?

Actually, this is data on billionaires all over the world downloaded from Forbes.com.

You may have a few questions about this data. For example,

1.Which country has the greatest number of billionaires?
 2.How did they become billionaires?
 3.Are more billionaires men or women?

Moreover, according to a number of articles I recently read in Japanese, European countries tend to have more women-friendly work environments. If this is true, is it also true that Europe has many female billionaires?

I worked at Exploratory in the past, which has a product called Exploratory Desktop. It provides an interactive and reproducible data wrangling and analysis experience powered by R. So, why not using Exploratory to analyze this data and see whether the work environment in Europe is truly women-friendly or not. Later on I’ll touch upon my thoughts about making the world better for working women based on this analysis.

Target Audience: Anyone who wants to learn Data Science

Before I begin, I’d like to clarify who my target audience is. In this tutorial, examples of people who might benefit from this tutorial are:

- Data Scientists who know basic R or Python.
- Data Analysts who are frustrated with Excel or Tableau.
- Beginning developers who have completed basic SQL tutorials online.
- A beginner who wants to learn R but doesn’t know how to learn for their research or statistical needs.

Anyway, let’s get started!

Preparing Data

To demonstrate, I’m going to use data on billionaires from Forbes.com. I’ve shared it in EDF (Exploratory Data Format) here so that you can quickly import it with ‘Import Exploratory Data’ option. If you are interested in the original data you can get it from here as well.

Overview: Does Europe have many women billionaires?

Here are the steps I’m going to take to find an answer to my question.

STEP 1: Prepare the project 
STEP 2: Calculate the ratio of billionaires for each country
STEP 3: Remove outliers
STEP 4: Are European working environments really women-friendly?
STEP 5: One more thing
STEP 6: Sharing the chart in a reproducible and collaborative way
Finally : Making the World Better for Working Women

STEP 1: Prepare the Project

First of all, you can create projects from here.

You can import the data. The dataset is here.

After typing the data frame name, click the ‘Save’ button to import the data into R. The data will show up in Summary View like below.

Summary View

Can you see how different the data looks? Thanks to Summary View we can quickly overview the data, something we can’t do in Excel by just importing data. For example, this citizenship column represents the number of billionaires for each country. As you can see, the United States has the most billionaires in the world. This self-made column represents how they became billionaires. When you see the gender column, you can see that there are more male billionaires than female billionaires at a glance.

However, we can’t analyze the data precisely the way it is now. Why not?

Let’s look at the year column and name column.

The year column shows data from 1996, 2000, and 2014. We can see Abigail Johnson has “(3)” in the name column. She was counted three times. In other words, Abigail Johnson was counted as one of the billionaires in 2014 and 2000 and 1996 in this data. To fix this, we should delete the data that is not from 2014.

In such case, how about we contrast this with Excel’s filter function?Exploratory enables us to choose commands or type code like SQL. Grammar should enable us to express things the way we want. Why not use the more natural SQL-like grammar instead of the syntax of Excel or Tableau?

Moreover, you can select the command from the column header dropdown menu.

Let’s select ‘Filter’ from the column header dropdown menu, which will generate a command like below.

Let’s specify the year as 2014 and press the ‘Run’ button.

filter(year == “2014”)

Now we can see only 2014’s data in Summary View and the number of names goes from three to one.

We can quickly visualize the data under Chart view.

Assign ‘citizenship’ to the X-axis and ‘gender’ to Color. Blue represents the number of women and orange represents the number of men.

The number of men is overwhelmingly higher than the number of women for billionaires in the world. Through only these steps, you can understand very easily what the data is about.

Reflections

  1. Get an overview of the data easily and intuitively by just importing.
  2. Recognize the issues in the data.
  3. Clean up the data.

That’s it for this step! Not too bad, eh?

STEP 2: Calculate the ratio of billionaires for each country

Note that to answer the question “Does Europe have many women billionaires?”, we have another problem. We want to know the ratio of men to women, not the absolute number of men or women. That’s why we want to calculate the ratio of woman billionaires for each country.

You can click the gear icon next to ‘Y Axis’. This will open ‘Window (Table) Calculation’ dialog and select ‘% of Total’ from the list.

Now we can visualize the rate.

You can see that Angola has a higher rate than the others. It seems that something is wrong.

STEP 3: Remove outliers

Let’s ‘Create Branch’ and count how many billionaires to investigate this situation. If you are coming from the programming background then this ‘Branch’ is similar to the ‘branch’ of source control management system like Git. Please take a look at this post for details.

Then, we can add any data wrangling step that is specific to this branch and quickly investigate this situation.

We want to count citizenship like Angola by using the ‘count’ command below. You can construct this command from the column header menu. Click on ‘Count’.

Here is the final command we want to run.

count(citizenship)

Let’s look at Angola to demystify that situation. Here’s how.

There is only one billionaire in Angola, which is not enough data to give a useful ratio. Let’s filter by population.

Here is the final command we want to run.

filter(n > 5)

Now the table includes only countries with more than five billionaires.

We can return all rows from Main branch ,the current data frame, where there are matching values in Remove_outliers branch ,the target data frame, keeping just columns from the current by using the semi_join command.

Now the citizenship includes only countries with more than five billionaires in total in Main branch.

STEP 4: Are European Work Environments Women-friendly?

Let’s go to Chart view to understand the data more intuitively by visualization.

As you can see, in Europe, high female billionaire ratio countries include Switzerland, the Netherlands, Germany, France, and Denmark.

Do you remember me mentioning the ‘selfmade’ column that describes how a person became a billionaire? By using this column, we can differentiate between becoming a billionaire by inheritance and becoming a self-made billionaire. To do this, I want to return to the filter step at the beginning.

Uh-oh, when I back to the filter step, the chart is different from the previous one. This is because the table includes countries with less than five billionaires because we don’t use semi_join command at the first filter step, which has a blue border above. The Pin button is here for such a problem.

When you click the Pin button, the last Semi_join step becomes blue. The chart is fixed to the last blue step. Then, when you come back to the filter step in the first place with this status, it looks like this:

Although we will change the first step now, the change will be automatically reflected in the chart because I used the Pin button to pin the chart to the blue step at the end. Now, let’s filter only billionaires by inheritance.

filter(year == “2014” & selfmade == “inherited”)

European countries like France, Germany, Spain, Sweden, and Switzerland have a large number of female billionaires by by inheritance, as you can see.

Also, let’s filter only by self-made billionaires.

filter(year == “2014” & selfmade == “self-made”)

Wow! The totals for many countries decrease suddenly. This is interesting! Only the United States and Switzerland have a large number of female self-made billionaires as you can see.

We’re Done!

I hope that was easier for you! In conclusion, of course, the number of female billionaires in Europe is a lot compared to other continents. But the key insight is that many of the female billionaires in Europe are billionaires by inheritance, and fewer are self-made billionaires. That’s why it may not necessarily be the case that European work environments are more women-friendly.

STEP 5: One more thing

This tutorial is almost done, but I want to show you one more thing that’s interesting from a technical standpoint. Although we have looked at the gender ratio, in the next step, why don’t we look at the rate per industry? We can see this by changing the only Color-axis.

Let’s implement this!

Although this chart is Pinned to the last blue step, Exploratory can automatically calculate and visualize the data when you update the grouping step. This feature is built on the top of our Exploratory DAG (Directed acyclic graph) engine. Please take a look at this post for details.

STEP 6: Sharing the Chart in a Reproducible and Collaborative Way

Once you’ve got a chart you want to share with others, you can click on the ‘Share’ button.

Inside the dialog you can type Title, Description, and Data Source information. Also, you can keep the ‘Share with data’ checkbox checked’ if you would like to do so. This will publish your chart with the data preparation steps.

Dialog

Once you click the ‘Share’ button you will see an URL being generated. You can simply click the ‘View Shared Chart’ link to open the published page.

Dialog

Your chart is now published at Exploratory Cloud.

Viz view

Now, from here, your audience can go to Data view to see not only the underlying data but also the steps it took to prepare the data.

Data view

Or, if your audience prefers, they can see the ‘easy-to-read’ dplyr chain of the commands in R script view.

R script view

The real power of ‘Sharing Chart with Data and Steps’ becomes obvious when your audience downloads and imports it into their Exploratory Desktop environment.

And you can share your interactive chart as well. Very simply, you just copy and paste the chart URL like below.

R script view

You can do the same thing in Slack, Wordpress, Twitter, Facebook and Tumblr, too. I have a question to you.

Can you imagine trying to recreate this process from the beginning by using only Excel?

Making the World Better for Working Women

How about Silicon Valley, not Europe? I’m here in Silicon Valley now, with its highly mobile human resources. It is famous for its merit system. However, if this is true, how can you explain the data below?

1. 85% of high-tech jobs in Bay Area tech companies are held by men.
2. 89% of Bay Area executives are men (National average is 84%).
3. 93% of Bay Area CEOs are men.
4. 96% of Bay Area VCs are men.

Data source

If you believe that the equation Silicon Valley = The perfect merit system holds true, we have no choice but to conclude that merit of men > merit of women from this data. But, in fact, that is definitely false. Silicon Valley is a male-dominated society.

#codedocumentary screening — why there are not many female programmers

Women have difficulty getting promoted or returning to their jobs after giving birth and raising their children in even Silicon Valley as you can see in the data.

There are many groups like She++ that work to increase the number of female engineers in Silicon Valley. I took part in their conference at Stanford. She++ thinks learning to code is practical because you can work remotely as a programmer too, while raising a kid.

It’s challenging to be a working woman in Japan, my home country, too. Japan should learn from She++.

Many people who believe in the perfect merit system are not aware of inequalities in society like the gender gap.

Do not forget that there is no such thing as a completely meritocratic society. There are many misconceptions all over the world. The most important thing is not fraud but authenticity. If you don’t believe me, I recommend you look at the data. I’m sure you can get a new way of thinking by looking at data. Then, I hope Exploratory will be your best friend! ;)

If you’re stuck

I’m always happy to hear feedbacks from you in the following contacts.

- Comment in the comment box at the very bottom of this page.
- Email me at hidetaka.koh@gmail.com.
- Tweet me at @SoccerKinki
- File an issue in this repo.

Your turn

You can sign up for access from this page. Let’s enjoy data analysis!

Thanks for reading. I’d appreciate a click on the “Recommend” button at the end if you liked this article.