What Are The Differences Between Data Scientists That Earn 500💲 And 225.000💲 Yearly?

This article is about important talents, tools, features of the country, and features of the company for high income in data science.

Hasan Basri Akçay
DataBulls
5 min readDec 27, 2021

--

Gender-Education-Job Title TreeMap Plot — image by author

Data science is a topic that is getting more and more popular day by day. With this increasing popularity, the difference between the income of data scientists is getting bigger. So what are the sources of this difference? Now, we are going to look closer at these sources.

In this article, we search for important features for high income in data science. The dataset is the Kaggle dataset that is a survey dataset about data science and machine learning in 2019. The survey was live for three weeks in October and is finished with 19,717 responses. This article has 3 parts are Data Cleaning, Data Analysis, and Results.

Data Cleaning

The objective here is to find the different data scientists with high and low salaries. Therefore, we dropped rows that are equal to ‘Student’ or ‘Not employed’.

df = pd.read_csv("../input/kaggle-survey-2019/multiple_choice_responses.csv", low_memory = False)
df = df[~df['Q5'].isin(["Student", "Not employed"])]

After that, we will look at data distribution for visual information.

cols = ['Q1', 'Q2', 'Q3', 'Q4', 'Q5', 'Q10']
plot_distirubution(df, cols)
Distributions of Six Questions, Graph— image by author

Data Analysis

The type of compensation_num is a string. Therefore, firstly we clean and convert string data to integer data. Then we divide compensation data into 3 groups and gave the ‘compensation_num_group’ name to this column. This column represents 3 different incomes; low, medium, and high.

df['compensation_num'] = df['Q10'].str.replace('$', '')
df['compensation_num'] = df['compensation_num'].str.replace(',', '')
df['compensation_num'] = df['compensation_num'].str.replace('> 500000', '600000')

quenstion_dict = {}
for index, value in enumerate(df.loc[0, :]):
quenstion_dict[df.columns[index]] = value

df['low_compensation_num'] = df.loc[1:, 'compensation_num'].str.split('-').str[0]
df['high_compensation_num'] = df.loc[1:, 'compensation_num'].str.split('-').str[1]

df['low_compensation_num'] = df['low_compensation_num'].fillna(-1)
df['high_compensation_num'] = df['high_compensation_num'].fillna(-1)

df['low_compensation_num'] = df['low_compensation_num'].astype(int)
df['high_compensation_num'] = df['high_compensation_num'].astype(int)

df['compensation_num'] = (df['low_compensation_num'] + df['high_compensation_num']) / 2
df = df[df['compensation_num'] != -1]

df = df.drop(['low_compensation_num', 'high_compensation_num'], 1)
df['compensation_num_group'] = pd.qcut(df['compensation_num'], 3, labels=["low", "medium", "high"])

After that, we calculated high and low — income differences by categorical_distribution_diff function and we plotted graphs according to categorical distribution score. The first graph plots the most important column and the last graph plots the least important column.

questions = group_cols(df)

score_cols = find_distribution_diff(df, questions, 'compensation_num_group')

sns.set(font_scale=1.2)

plot_salary(df, score_cols, quenstion_dict, target='compensation_num_group', country='all')
Distributions of Incomes, Graph — image by author
plot_parallel_categories_salary(df, country='all')
Education-Gender-Experience Parallel Categories Plot — image by author
plot_treemap_salary(df, target='compensation_num_group', country='all')
Gender-Education-Job Title TreeMap Plot — image by author
plot_point_salary(df, target='compensation_num_group', country='all')
Age-Experience-Compensation Scatter Plot — image by author

Parallel Categories, TreeMap plots are interactive plots in Kaggle Notebook and all plots are not in this article. You can see full python code and all plots from here 👉 Kaggle Notebook.

Results

According to the order of count plots, the country is the most important feature for a high salary. USA, Germany, and Canada are the most common high-paying countries for data scientists.

The second important feature is the experience. A data scientist that been working for 5 years in Data Analysis and for 3–4 years in Machine Learning, is experienced. Age and experience are the same features and both of them represent the experience.

Spending money on machine learning is the third important feature. If someone spent over 1000 dollars on machine learning, it can earn more salary than other data scientists. But the most significant thing in this plot is a lot of data scientists have high salaries without spending money on machine learning. This situation shows us the importance of free sources.

The fourth important feature is the properties of the company that data scientists work for. If the company has a machine learning model for more than 2 years in product, 20 people that are responsible for data science, and more than 10 000 employees, that company give a high salary for a data scientist.

Job title and education of the data scientist are other important features. The data scientist that has a Product/Project Manager job title and a Doctoral degree or Master's degree earns a high salary. These features are tied with features of experience because becoming a Product/Project Manager and having a Doctoral degree or Master’s degree needs time.

Programing language and databases are also important features for data science. The most common programming language is python but SQL is more significant for a high salary and all databases are serious features for high compensation in general. According to this result, we can say “All good data scientists have to know databases”.

There are a lot of online courses for data science. The most common online platform for data science courses is Coursera and Fast.ai has the highest rate for high salaries.

Machine learning models are used for making decisions. The most important machine learning model is Xgboost. This model is also the most common ml model in Kaggle.

Cloud computing and data analysis tools are also important features. AWS is the most important tool for data science.

--

--