What does Kaggle Survey Tell Us about Data Science

Using Graphext to understand user profile

Kaggle recently published its second annual Machine Learning and Data Science Survey. About 24,000 users across the globe responded to this survey, disclosing much information about their demographic, behaviour, and opinions. It gives a unique peek into the Machine Learning and Data Science industry.

Through analyzing the survey, we want to explore what matters the most to a specific group of people. But the challenge with any survey analysis has always been, what is the best way to define “a group”? Should people be segmented based on their demographic characteristics or by their behaviours? How can we investigate the data thoroughly so we could understand both macro trends (e.g., global technology industry) as well as micro trends (e.g., female data scientists working in Brazil that predominantly use R)? In this article, we attempt to answer these questions using our software Graphext to run the survey analysis.

The beauty of Graphext is that we allow users to explore data in two ways: first, through unsupervised data segmentation; second, by answering specific questions.

Part 1 — Unsupervised Segmentation

After uploading the dataset into Graphext, the software automatically plotted a topology of the survey data (in this case, it looks like a huge hot balloon), and automatically segmented the responses into several clusters. Personally I always find it very helpful to start with a generalized profile or a user considering many different characteristics. Once I could visualize who this user is and what he/she does day to day, usually their behaviour makes more sense to me.

So looking at these clusters segmented by unsupervised learning, a few interesting profiles jumps out from the data.

Blue Cluster (20% of population): John is 38 years old and lives in San Francisco. He works in a technology company as a senior data scientist. He has a Doctoral degree from computer science and has worked in the data science field for the past 6 years. His company is quite mature with technology so he mostly works with big data. His work consists of building prototypes and then running ML models on AWS. He uses Python, especially the Scikit-learn library, as he mostly work with tabular, text and time series data.

Red Cluster (10% of population): Ming is a young researcher from Shanghai. He is 25 years old, and he is exploring different ML methods to advance his research. His research focus is image recognition. He predominantly uses Python, and TensorFlow and Keras libraries to process image data. As he doesn’t work with big dataset, he mostly uses hosted solutions, but occasionally he uses Alibaba cloud. He has a software engineer degree and has only 2–3 years of experience in ML. For him, ML is a black box, although he uses it, he doesn’t feel the needs to explain the output to his audience.

Brown Cluster (8% of population): Karina is a business analyst from London. She is 27 years old and she works for an insurance company. She has an undergrad degree in statistics and although she is quite interested in data science, she has never gone through any training. She has been doing her job for 4 years now, and her main goal is to develop analysis to influence decision makers in her company. She rarely uses ML, but when she does, she uses Random Forest and Caret libraries to process numerical data. For her it is very important to explain the output of her model to her stakeholders and avoid any biases.

As you can see, to build a rich user profile, you need to provide many detailed characteristics. It is very hard to do so manually when you have too many data dimensions to consider. But with a tool like Graphext, you can get an unbiased result within minutes.

Part 2 — Answer Specific Questions

Once you develop a general understanding of the survey population, very often, you may want to use the data to answer specific questions. In this case, we want to know how are users in Europe different from users in US. To do so, we selected two sets of users: users from the top 5 European countries (UK, Germany, France, Spain, Italy) and users from US. We simply group them to narrow down our analysis

Graphext has a “compare” function that allows you to compare two or multiple groups by automatically highlighting the variables that could statistically explain their differences. And here is what we found out using this function (US in blue, Europe in orange):

To start with, the salary level for European users is much lower than their counterparts in the US. Looking at their background, the European users seems to be more technical: they are more likely to have Master or Doctoral degrees in Computer Science, Mathematics and Physics degree and they work as data scientists and research scientists. Whereas in US, the users are more likely to have Bachelor degrees in Engineering, Business or Life Sciences,. and other than the the typical role like data scientist, they also work as data analyst and business analyst. The level of technical experience is also reflected on the type of tools they use.

When we continue to explore their differences, we see that European users work more with time series data whereas US users work more with numerical data. When we look at the industries these users work in, the percentage of users work in Medical and Pharmaceutical industries in the US is almost twice compared to the percentage of users in Europe. In addition, it seems there is a slightly more percentage of male users in Europe compared to that in the US.

Despite all the differences, their frustrations seem universal. They all spend too much time on data cleaning, they find it hard to explain black box models, hard to make work easy to reuse, and hard to make algorithms fair or unbiased. I guess these shared frustrations are what bond us in the DS and ML discipline and make us learn from each other.

I hope you enjoyed reading this article. And if my analysis intrigues you, then I encourage you to ask us for a demo to see for yourself what Graphext could do to explore data!