5 things I learned by analysing Kaggle’s 2020 Machine Learning and Data Science Survey

Huey Fern Tay
Nerd For Tech
Published in
4 min readNov 18, 2021

By Huey Fern Tay

With Greg Page

Every year since 2017, Kaggle has canvassed the opinions of its email subscribers, discussion forum members, and social media followers through a large-scale survey aimed at understanding aggregate trends in the world of data analysis and modeling. Survey respondents typically include students as well as business analysts, data scientists, data analysts, data engineers etc. By analysing responses to questions about topics such as the machine learning tools that practitioners are currently using, the tools they hope to become more familiar with, and their recommendations to aspiring data scientists, we can glean interesting insights about major trends in the data science community.

In the interest of brevity, this article examines just a few aspects of the platform’s 2020 survey.

  1. Python reigns supreme regardless of occupation

Python was the runaway winner, with 15,530 respondents saying they used it on a regular basis. That figure represents 80% of the sample size for this question. The heat map below highlights Python’s ubiquity among all professions included in this survey, a prevalence which underscores the language’s versatility.

Above: author’s image

2. People were nearly 2x more likely to use Python + SQL regularly compared to Python + R

When we took a deeper look at the question posed above (‘Which programming languages do you use on a regular basis? Select all that apply), it became clear that Python and SQL are a popular combination — 6667 people said they often use both. While it is possible to run SQL queries in R, the R + SQL pairing was approximately half as frequent among survey respondents as was the Python + SQL pairing (22.7% vs 42.9%).

Above: Author’s image

3. Learn Python

Given Python’s overwhelming popularity among the survey’s respondents, it is not surprising that a high proportion of them strongly recommend it to aspiring data scientists. 14,241 out of 17,821 people who answered this question made this recommendation.

Above: Author’s image

4. The past five years appears to have seen a big boost in machine learning practitioners

Even though machine learning research has been ongoing since the 1950s, a combination of factors has contributed to its recent popularity surge. For starters, computers are more affordable and powerful. Operations that once required specialized, expensive equipment can now be run from a personal laptop. Cloud computing infrastructure has become more advanced and democratised. Meanwhile, the sheer scale of big data delivered by the digital revolution has made it possible for businesses to ingest and analyse many more data points faster and cheaper than before.

Given the resurgence in machine learning, it is not surprising that just 12.67% of respondents said they do not use machine learning methods. The majority of respondents have done so within the past five years, including students.

Many respondents with five or fewer years of coding experience use machine learning frameworks such as Scikit-learn and TensorFlow on a regular basis (6746 / 10250 Scikit-learner users, and 4484 / 6934 TensorFlow users respectively). The former, which contains a robust library of tools in Python, can be used to perform classification, regression, and clustering among others. The latter, an open-source library released by Google in 2015, is used by companies like AirBnB to build and train models to solve problems like image classification.

Only 3% of respondents to this question said they do not use machine learning frameworks regularly.

Above: Author’s image

5. It’s hard to tell if auto or partially auto machine learning tools remain a niche toolkit

One challenging aspect of machine learning is determining the appropriate algorithm or tool to use, as there are trade-offs with each approach. As you can imagine, this iterative process is resource-intensive and takes a long time. However, not many respondents said they used auto/partial auto machine learning tools such as Auto SK Learn, Google Cloud AutoML, and Auto-Keras on a regular basis, even though these can reduce the time associated with model development and accelerate the creation of production-ready models.

Above: Author’s image

It is worth noting that by this point in the survey (Question 34 asked about automated or partially automated ML tools), the proportion of non-responses was above 95%. A sense of survey fatigue may have set in among many respondents by this point.

Above: Author’s image

Thanks for reading.

Data source: Kaggle

--

--