Geek Culture
Published in

Geek Culture

How Much Do You Know to Enter Data Science Field?

Photo by Adli Wahid on Unsplash

Introduction

Since a couple of years ago, Data Science’s hype has been increasing. Many are trying to enter the field, no matter who they are, man or woman. However, as the field is one of the technology fields in which usually men dominate the workforce, if you are a female, do you have a chance or similar opportunities in driving into the data science field?

In addition, as the field has various roles, e.g. Data Scientist, Data Analyst, Data Engineer, Machine Learning (ML) Engineer, etc., don’t you think what skills are required to become one of them? Or even if you were to master most skills needed for all of the four, will you earn more?

To satisfy our curiosity, therefore, in this article, we’re interested to answer the following questions:

  • How promising data field for women compared to men?
  • What are the skills needed to become a data scientist, data engineer, data analyst, or ML engineer?
  • By becoming a Full Stack Data Professional, do you earn more?

Contents

After the chapter above, you will find the rest chapters listed below.

  • Preparing Data
  • Question 1
  • Question 2
  • Question 3
  • Conclusions

Preparing Data

To answer the questions, we can use the latest survey results released by Kaggle, i.e. “2020 Kaggle Machine Learning & Data Science Survey”. The data is set out to conduct an industry-wide survey that presents a truly comprehensive view of the state of data science and machine learning. You could find the dataset here or download it from its website here.

Besides, to understand more the data, it is recommended to read the supplementary data as it contains two PDF files explaining the questions asked to the respondents and the methodology used in the survey. From the methodology file, we can infer that all empty data do not mean incomplete or corrupt data, by contrast, that means the respondents indeed could not answer given questions.

If we peek at the dataset using a Spreadsheet, we can find its structure is something like the picture below.

Question 1: “How promising data field for women compared to men?”

To answer this question, below are the aspect of interest I think we need:

  • Age
  • Education
  • Role
  • Experience
  • Salary

We will see the comparison of women and men in the above aspects.

Age

We will also calculate the ratio for each woman and man. This ratio indicates the proportion in each woman and man data.

To make this finding clearer, it will be better if we plot this result in a pyramid plot that compares both woman and man data.

From the chart of counts above, we can see that in the case of focusing only on the counts, men dominate for all age ranges. This phenomenon is common in any field of information technology, including data science. However, if we look at the data from a different angle where we calculate the ratio for each age group, we will find something interesting as shown by the second chart.

Those ratios show the proportion of certain age groups from total data in given gender. We can see that even though the counts of women data are less than those of men, their proportions indicate a different thing. Most higher proportions are dominated by women with ages between 18–21, 22–24, and 25–29. Even these three proportions overcome those of men. From this finding, we could infer that gradually women are more interested in the IT field, such as data science, in particular young women.

Education

We’ve found that by looking at the age, young women gradually are interested in the data science field. But, will this happen on the education aspect as well? We’re going to do the same as we did earlier on the age data.

Again, if we only focus on the counts, it’s obvious that men win the comparison. But, the ratios tell different things. Both women and men with Master’s degrees as their last education are higher in number than other degrees.

To make this more interesting, let’s check out whether women with Master’s degree as their last education comes from a youngster.

Great. Those women with a Master’s degree as their last education are the highest in number as shown by the table above. This is another evidence to prove that progressively young women are more interested in diving in the field.

Role

Let’s do the same for data in the role aspect.

As usual, we’ve found the same thing for the count plot. But, for the plot of the ratio, what we’ve found really strengthens our hypothesis that young women are gradually interested in the data science field. The ratio plot above indicates that there are a huge number of aspiring data scientists that are currently still a student who is usually a young woman. To prove that students are usually youngsters, let’s explore more the data of women students.

Excellent. This is what we expected. The woman students are dominated by young women as shown by the table above. This is another evidence to prove that progressively young women are more interested in diving in the field.

Experience

Let’s do the same for data in the experience aspect. Here, we have two kinds of experience: coding experience and machine learning (ML) experience.

Coding experience
ML experience

For both coding and ML experience data, as usual, men dominate the data for every experience range. However, we’ve found that aspiring female data scientists or women with little coding and ML experience dominate the data. Those with minimum experience should be youngsters. But, to prove this hypothesis, let’s check the age for this proportion.

Age group for women with minimum coding experience
Age group for women with minimum ML experience

Excellent. This is what we expected. The women with minimum experience both in coding and machine learning are dominated by young women as shown by both tables above. This is another evidence to prove that progressively young women are more interested in diving in the field.

Salary

Let’s do the same for data in the salary aspect. This aspect is the one we’ve been waiting for.

From both charts above, we can see, again, men always dominate in terms of numbers. But, if focus on the ratio chart, there is a large proportion for the least salary range in women data. This proportion should be for those women with minimum experience since usually less experienced employees earn less than the more experienced ones. However, to prove this hypothesis, let’s explore this proportion where the gender is woman and salary range is between 0 and 999 USD.

Coding experience group for women with salaries between 0 and 999 USD
ML experience group for women with salaries between 0 and 999 USD

Great! Our hypothesis is proved. the proportion is dominated by women with minimum experience both in coding and machine learning. Their experience ranges from under 1 year to below 3 years.

Question 2: “What are skills needed to become a data scientist, data engineer, data analyst, or ML engineer?”

To answer this question, below are the columns of interest I think we need:

  • Q5 role
  • Q7 programming language (many columns)
  • Q8 programming language recommended
  • Q12 specialized hardware (many columns)
  • Q14 data viz library (many columns)
  • Q16 ML framework (many columns)
  • Q17 ML algorithm (many columns)
  • Q18 CV method (many columns)
  • Q19 NLP method (many columns)
  • Q23 important activity (many columns)
  • Q26 A/B cloud computing platform (many columns)
  • Q27 A/B cloud computing product (many columns)
  • Q28 A/B ML product (many columns)
  • Q29 A/B big data product (many columns)
  • Q30 big data product used most often
  • Q31 A/B BI tool (many columns)
  • Q32 BI tool used most often
  • Q33 A/B AutoML category (many columns)
  • Q34 A/B AutoML (many columns)
  • Q35 A/B ML experiment (many columns)
  • Q36 deploy (many columns)

Here, we should combine all skill columns into several groups of skill. After looking at those skill columns thoroughly, I’ve found they can be classified into the following groups:

  • Main Activity: Q23
  • ML Deployment: Q36
  • Database & Big Data: Q29 and Q30
  • Data Visualization: Q14, Q31, and Q32
  • AI: Q18, Q19
  • ML Modeling: Q16, Q17, Q33, Q34
  • Programming: Q7 and Q8
  • Cloud: Q26, Q27, Q28

From the radar chart above, we can see that each role has a different strength of skills. For example, ML Engineer is the one with the most skilled in ML Deployment, AI, and ML Modeling; Data Engineer is strong in Cloud skill, and Data Analyst is the one with Data Visualization skill the most.

The unique one I think is the Data Scientist role. It does not have the strongest skill in any common skills. Instead, it shares almost the same power for every aspect. Perhaps, this can support the theory where the Data Scientist role is a multidisciplinary job.

In addition, another interesting finding is that it can be seen that all roles have the same share for Programming skills. This can infer that no matter what role you’re interested in the data science field, programming skill is a must.

We have done with common skills in data science roles. Now, let’s see what we will find with the “Main Activity” group.

Unlike the chart of common skills earlier, the above plot of common activities shows more varied results. We can see that each role indeed has distinguished activities. We can also find that each of them has a different emphasis.

Data Analyst, for example, as we can expect, is the one who spends his/her time mostly on analyzing and understanding data to influence product or business decisions. Even this activity climbs far enough from other roles. Data Engineer, on the other hand, spends his/her time mostly on building and/or running the data infrastructure that his/her business uses for storing, analyzing, and operationalizing data. This is what we can expect from a data engineer.

ML Engineer, as its name, focuses more on ML part activities, such as researching the state-of-the-art of machine learning and building and/or running ML services. For Data Scientist, we can see that this role intersects with most ML Engineer activities. What differentiates Data Scientist activities from ML Engineer activities is the activity of analyzing and understanding data. The only top activity for Data Scientist is only for building prototypes to explore applying machine learning to new areas.

Question 3: “By becoming a Full Stack Data Professional, do you earn more?”

Like Full Stack Developers who can master and perform both front-end and back-end in web development, we can also refer a data professional who does the upstream and downstream tasks in the data field as a Full Stack Data Scientist/Professional. Like a swiss-army knife which serves many tools in a single knife, we expect a Full Stack Data Scientist as a ‘one-man army’ who can do all things regarding the data field.

So, by mastering many things, a Full Stack Data Scientist, I think, should earn more money compared to other common roles. But, is that so? Let’s prove it.

To answer this question, below are the columns of interest I think we need:

  • Q5 role
  • Q7 programming language (many columns)
  • Q8 programming language recommended
  • Q10 hosted notebook (many columns)
  • Q12 specialized hardware (many columns)
  • Q14 data viz library (many columns)
  • Q16 ML framework (many columns)
  • Q17 ML algorithm (many columns)
  • Q18 CV method (many columns)
  • Q19 NLP method (many columns)
  • Q23 important activity (many columns)
  • Q24 salary
  • Q26 A/B cloud computing platform (many columns)
  • Q27 A/B cloud computing product (many columns)
  • Q28 A/B ML product (many columns)
  • Q29 A/B big data product (many columns)
  • Q30 big data product used most often
  • Q31 A/B BI tool (many columns)
  • Q32 BI tool used most often
  • Q33 A/B AutoML category (many columns)
  • Q34 A/B AutoML (many columns)
  • Q35 A/B ML experiment (many columns)
  • Q36 deploy (many columns)

We’d like to have data where no redundant columns (but no value removed) and no NULL value are found in each row for every column of interest.

After pre-processing the data, let’s see which role has the most skills.

So, the top 4 roles are Data Scientist, ML Engineer, Software Engineer, and Data Analyst. Let’s explore the salary for those four roles.

From the table above, it is risky to conclude that mastering most data skills could lead you to make more salary. Only an extremely little proportion of people is lucky enough to earn more money. To make this clearer, let’s see the final mean for each role.

Average salary for most skilled roles

Great. We can see the final average for each role. All roles, except Software Engineer, make more than 40K USD a year on average if they master most skills in the data science field. For Software Engineer, I think it’s self-explanatory that this role earns the least here. This role works as a software engineer, therefore the skills necessary for it to earn more are the skills for software engineering, not skills for data science.

In order to understand whether the averages above are worth more money than the common data professionals, it’s necessary to compare the results with those from the common data professionals.

Average salary for common skilled roles

Unexpectedly! The final averages for every role of the common ones are much lower than those from roles with the most skilled mastered. From this finding, we can conclude that it is valuable and useful enough if you decide to master most skills in the data science field as you will have more chance to earn more money than those who have general knowledge.

Conclusions

We have done many interesting things. We’re interested in the data science field and attempted to explore those who are in the field about who they are, what they do, what they can, what they earn, etc. Kaggle Survey 2020 is the dataset we used for this exploration. You could also explore other Kaggle surveys, i.e. 2019, 2018, and 2017 surveys, if you’d like to find comparisons or trends.

We focus our interest on three questions:

  • Women in the data science field
  • Skills needed
  • Full-stack data scientist

We have tried our best to explore the data to answer the questions. Many useful functions, dataframes, and plots are utilized to deliver the best data understanding. And finally, we’ve arrived in these conclusions.

So, for the 1st question “How promising data field for women compared to men?”, what we’ve found are as follows:

  • Still, like other IT fields, the data science field also suffers from women enthusiasts. If you found men in any IT field, it’s not obvious. But, this does not apply to women.
  • However, there is a slight breeze. Young women, on the other hand, evidently are gradually interested in the data science field more than those older women.
  • This finding is evidenced when we explored the data in age, education, job role, experience, and salary aspects.

For the next question “What are skills needed to become a data scientist, data engineer, data analyst, or ML engineer?”, our findings are the following:

  • We split the data into two parts: common skills and activities.
  • We utilized two radar charts to plot common skills and common activities for the roles.
  • From the first chart, we’ve found that each role has a different strength of skills.
  • However, unlike other roles, Data Scientist shows a unique result. It does not have the strongest skill in any common skills. Instead, it shares almost the same power for every aspect.
  • This data scientist phenomenon could be evidence for the theory where the Data Scientist role is a multidisciplinary job.
  • Besides, all roles have the same share for Programming skills. This can infer that no matter what role you’re interested in the data science field, programming skill is a must.
  • From the second chart, it can be seen that each role has distinguished activities. Also, each of them has a different emphasis.
  • ML Engineer and Data Scientist roles intersect each other on most activities, except for the activity of analyzing and understanding data.

For the last question, “By becoming a Full Stack Data Professional, do you earn more?”, below are our conclusions:

  • We filtered the dataset to only contain respondents with no NULL skills. That means the most skilled respondents.
  • We’ve found that the top 4 roles with the most skills are Data Scientist, Machine Learning Engineer, Software Engineer, and Data Analyst in ascending order.
  • All roles, except Software Engineer, make more than 40K USD a year on average if they master most skills in the data science field.
  • When the average salary results of those top roles with most skills mastered compared to the data of common data professionals, unexpectedly, the common data professionals’ salaries are much lower than those from roles with most skilled mastered.
  • It can be concluded that it is valuable and useful enough if you decide to master most skills in the data science field as you will have more chance to earn more money than those who have general knowledge.

Acknowledgments

Acknowledgment should go to Kaggle for providing the dataset. This article is one of the Projects of Data Scientist Nanodegree on Udacity. If you’d like to access the codes of how I processed the results in this article, you can visit my repo on GitHub here. Feel free to explore it.

Thanks for reading!

--

--

--

A new tech publication by Start it up (https://medium.com/swlh).

Recommended from Medium

Essential Math And Statistics For Data Science

COVID Italian figures

What are the Minimum Requirements to Become a Data Analyst?

#11 Data Science : Test the data science model using your own image

Shh…The Secret to Building Great AI

A Dog Detector and Breed Classifier

Ensemble models

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Reza Dwi Utomo

Reza Dwi Utomo

An engineer specializing in the data-driven analysis | AI Enthusiast | Find me on linktr.ee/utomoreza

More from Medium

LGMVIP Data Science Intern

Are Open-Source Data Science tools in perennial slumber?

The Data Science Business stoppers

CRISP-DM how to deliver on a Data Science project