Data Science and ML Trends in 2020–2022: Portraits of the Industry and Raise of AutoML

Published in

SBC Group Blog

15 min readFeb 1, 2021

Introduction

The key asset of every knowledge-intensive industry (Software Development, Data Science and ML Engineering inclusive) is people. The way they operate, learn, collaborate, and share knowledge is essential part of whether the projects and initiatives are set for success.

At the other hand, the organizational environment is the industry glue that enables or equally disables human power contributions to the certain extent.

Everything mentioned above will affect the shape of Data Science and ML industry in 2021–2022.

The Data Science Fellow cohort photo (reused from here)

In this post, we are going to look at ‘Portraits and Landscapes’ of Data Science and ML that would determine the industry trends in 2021–2022. It will cover

Human capital of the industry (demography of ML and Data Science professionals, their education level, job responsibilities as well as the way they learn, collaborate, and share knowledge with each other)
Organizational environments to affect ML adoption in business and public sector
Usage of automation tools (AutoML products, software to manage various Data Science and ML experiments etc.) by the industry professionals

The insights conveyed by this article are based on the analysis of the data collected in Kaggle’s survey of ‘State of Data Science and Machine Learning 2020’ (https://www.kaggle.com/c/kaggle-survey-2020).

Note: Kaggle (www.kaggle.com) is a global community made up of data scientists and machine learners from all over the world with a variety of skills and backgrounds. The community has around 3 million active members. Although it is not rigorously representative of the entire population of Data Science and ML professionals across the globe from the sociological perspective, it still constitutes the significant fraction of the practitioners and professionals in the field. Therefore, the results of the survey can really draw the projections of where the Data Science and AI/ML industry is likely to evolve in the next couple of years.

Who Are You, Mr. and Mrs. Data Scientists?

As we can see from the survey results, the shape of the Data Science and ML industry was shifting in 2020. It is also predicted to see such a shift extended through 2021–2022.

The industry becomes more extensive and younger. The number of the industry practitioners grows, especially with more juniors (like students or fresh graduates) taking the path in Data Science and ML.

We also see the percentage of women contributing to the industry is growing in 2020 vs. 2018 and 2019, and it is likely to go on this way in 2021–2022.

At the same time, it highlights the average experience level of the survey participants to be relatively low. The majority of the survey respondents fell into the following clusters

Professionals with less then 1 year of programming and less then 1 year of ML experience
Professionals with 1–2 years of programming and less then 1 year of ML experience
Professionals with 3–5 years of programming and less then 1 year of ML experience
Professionals with 1–2 years of programming and 1–2 years of ML experience
Professionals with 3–5 years of programming and 1–2 years of ML experience

It indicates the industry is actively expanding now. 2020–2022 are going to be the years of further proliferation of the industry. Students and junior professionals will get the relevant experience in 5 years from now, and then it will change the landscape of Data Science and ML fields.

Where Are You, Mr. and Mrs. Data Scientists?

In terms of the geography, we see that

The most the survey respondents reside in India
The US holds the second place in the country rank
The next places in the top rank list are held by Japan, Brazil, UK, China, Russia, and Nigiria, respectively
Developed nations in the Old World (UK; EU nations like Germany, France, Spain, Italy; Turkey), Canada and Australia are quite behind the countries in the top rank list
At the same time, some of the developing nations (Indonesia, Mexico, Pakistan) and Taiwan are on a par with the developed nations in the Old World that are not in the top rank list
South Korea and China are surprisingly under-represented among the survey participants

Note: Under-represented population of Data Scientists and ML professionals from China in Kaggle may represent some sort of ‘isolationism’ of such professionals that tend to contribute to their local Data Science platforms.

We can also see clear ‘challenger’ nations in terms of the growing number of Data Science and ML professionals (Brazil, Nigeria). When the professionals there get more experience (they are not as experienced as their counterparts in the US, Old World and India are), these countries will bring new surprising AI and ML technology innovations to the global scene.

Age and Level of Education

We can see that there are four biggest Kaggle population clusters in the space of Education level and Age as follows

Bachelors of the age of 18–21 (the biggest cluster so far)
People of the age of 25–29 with the Master degree
Bachelors of the age of 22–24
Masters of the age of 22–24

On top of the population cluster insight above, we can draw the additional intelligence as follows

People with Bachelor degree predominate in more junior age groups
In senior age groups, the percentage of Master and Doctoral degrees grows
People with Master degrees predominate in the age groups of 25–29 and older
Starting the age group of 35–39 and older, the amount of respondents with Doctoral degrees is only a little less than the respondents with Master degree for the same age group

Gender and Level of Education

As we can see, there are top clusters of the survey respondents within the dimensions of Gender and Level of Educations as follows

Males with Bachelor’s degree
Males with Master’s degree
Males with Doctoral degree
Females with Bachelor’s degree
Females with Master’s degree

Bachelor and Master degrees are the most common levels of education for every gender.

Level of Education and Programming Experience

As we can see there are four top clusters among the survey responders within Education and Programming experience dimensions as follows

People with Bachelor’s degree having 1–2 years of programming experience
People with Master’s degree having 3–5 years of programming experience
People with Bachelor’s degree having 3–5 years of programming experience
People with Master’s degree having 1–2 years of programming experience

For more experienced categories (5+ years of programming experience), we find that the ratio of people with Master’s degree exceeds one for Bachelor’s degree.

Percentage of people with Doctoral degree is higher in the most experienced categories (programming experience of 10–20 years as well as 20+ years).

Where Are the Industry Veterans?

The location of the industry veterans (having 20+ years of programming experience and/or having 10+ years of ML experience) does not fully correspond to the countries with the biggest number of professionals in the field.

As we can see, the biggest number of programming veterans among the survey respondents reside in the US. The rest of the countries are well below the bar set by the US.

However, there are several countries that still stand out in terms of their population of the programming veterans participated in the survey. These are

Japan
UK
Brazil
India

As we see, India, although represented by the largest population of Kagglers among the survey respondents, does not possess the huge pool of the experts with 20+ years of programming experience vs. the rest of the leading countries.

As displayed above, the US absolutely predominates the rest of the world in terms of the number of the survey participants having 10+ years of ML experience.

Organizational Environment and Enterprise-Scale Data Science/ML

The important aspect of implementing Data Science/ML-related functions in the modern organization are predominated by (1) the organizational environment and (2) job responsibilities of the professionals involved.

The survey highlighted essential insights on how the factors below affect the corporate Data Science/ML processes. These are

Size of the employer organization
Size of the team working in Data Science-related areas
Adoption of ML in the organization
Individual job responsibilities of the survey participants
Spending on ML and Cloud Computing in the last 5 years

Organization Size, Data Science and ML Adoption

As we can see, the largest cluster of the survey responders work for the small organizations (0–49 employees). Within that bucket, most of the organization has either 1–2, 0 or 3–4 employees responsible for data science workloads.

Most of such organizations is still exploring ML methods. However, some fraction of such organizations

is reported to recently start using ML methods, or
do not use ML currently

Small organizations with no workers dedicated to data science workloads mostly do not use ML in their daily operations (which is a kind of expected intuition).

Interesting (small-sized) subclusters of small organizations with 5+ employees dedicated to data science workloads display mature usage of ML solutions in production.

Second and third largest clusters of survey responders constitute those who work for

organizations with 10000+ employees
organizations with 1000–9999 employees

In such clusters, we can see quite a lot of respondents to indicate their organizations to have well established ML methods.

Overall, irrespective to the organization size, it seems that the size of a Data Science-centric team of 5+ workers indicate certain level of ML methods maturity within an organization. In organizations in every size group, the companies with 5+ workers dedicated to Data Science activities indicated to either start using ML in production recently or have the mature ML methods established.

ML Spending

From looking at the charts, we can see the most popular option as for ML spending is USD 0, regardless the organization size. It may be indicative of the inaccuracies in the responder inputs when they responded to the respective question of the survey (or otherwise they could fake the answers for a reason).

Still, if considering the meaningful responses, it looks like

large organizations tend to spend more in ML (100+ k USD in the last 5 years)
smaller-sized organizations spent 100–999 or 1000–9999 USD in ML in the last 5 years

Will the AutoML Age Commence Soon?

We are going to discover how Data Science automation tools are utilized in the industry. It conveys findings as for

the usage of automated machine learning tools
the usage of tools to help manage machine learning experiments

Using AutoML Tools: By-Purpose View

We find that

the vast majority of the survey participants do not use any Auto ML tools in their daily routines
automated model selection tools is the most popular type of Auto ML tools used by the kagglers at the moment
automation tools for selecting a neural network architecture are not vibrant and well known
Data Scientists are the primary users of Auto ML tools (if in use)

Using AutoML Tools By Purpose and Size of Employer Organization

We find that

Both the large-sized and small-sized organizations are equally represented across the cluster of organizations where auto ML tools are not used now
For each type of auto ML tools, the most usage of them is observed in small-sized organizations (it is also correlated with the fact most of the survey respondents work for such organizations in fact)
Using AutoML Tools By Purpose and Size of Data Science Team in Organizations
Both the organizations with large-sized and small-sized data science teams are represented across the cluster of organizations where auto ML tools are not used at the moment
If AutoML tools used, the most use of them is observed in organizations where Data Science team has capacity of 1–2, 3–4, or 20+ workers

Utilizing AutoML Tools By Product and User Occupation

In addition to the insights above, we find that

For every auto ML product surveyed, professionals with 1–2, 3–5, 5–10, and 10–20 years of programming experiences are equally represented among the users
Professionals with 20+ years of programming experience are less attached to using auto ML products

Utilizing AutoML Tools By Product, Size of Organization, and DS Team Capacity

We find that

Small-sized organizations (0–49 employees) have more inclination to play around Auto-Scikitlearn and Auto-Keras
For larger-sized organization, the biggest cluster is always the subset of organizations that do not use any auto ML, regardless the size of their Data Science team

Utilization of Tools to Manage DS Experiments

We find that

Most of the survey respondents do not use any tools to help to manage data science experiments (so the market for such tools is extensively under-saturated)
TensorBoard is the leading tool in use at the moment
Data Scientists, Software Engineers, and Data Analysts are the top occupations of the users of such tools

Knowledge and Information Sharing Channels

Knowledge is power, as Sir Francis Bacon told one day. Additionally, information becomes the power and ‘new oil’ in the modern post-industrial age.

Therefore it is essentially interesting to draw insights on how kagglers work with information as well as share/obtain professional knowledge.

In this section, we are going to review the preferences of the survey participants as for

Platforms and tools to publicly share or deploy their data analysis or machine learning applications
Platforms to take online data science courses
Primary tools they use to analyze data
Pavorite media sources that report on data science topics

Public Sharing Platform Preferences by Occupation

We find that

GitHub is the primary choice for Kagglers to share their work publicly
Kaggle itself as well as Colab take the next places in the top 3 list of platforms to publicly share the work assets
Good work sharing instruments like Streamlit and Plotly Dash are under-utilized by the community
Quite a big fraction of the survey respondents indicated they do not share any work publicly
The percentage of Business Analysts and people with ‘Other’ occupation who do not share their work results publicly is higher vs. the rest of the occupations

Public Sharing Platform Preferences by Occupation and Programming Experience

In addition to the insights above, we find that

GitHub and Kaggle are the preferred sharing platforms for every occupation and programming experience level groups
The percentage of professionals not sharing their work results publicly is higher with more experienced professional groups (people with 10–20 and 20+ years of programming experience) — the replied with ‘None’ more frequently then other experience level group members

Kaggle vs. Colab Usage

After purchasing Kaggle platform, Google started to own two similar technologies for data scientists to work and share their work results publicly in the cloud at a reasonable or even no cost (these are Kaggle notebooks and Collab notebooks). Both instruments are utilized by the industry professionals now.

We can see that

Kaggle is used more actively by males in the majority of occupations, levels of education and programming experience levels
The most tangible diff is observed for male Data Scientists with Master’s and Bachelor’s degree, with programming experience between 1 and 10 years
in a sense, we can think of Kaggle notebooks and Colab to compete over the same target audience

In the evidence of the above-mentioned data-driven insights, we can assume Google could benefit from re-positioning Colab from the functional and marketing stand-points, to drive more paying users to it.

Online Training Platform Preferences by Occupation

We find that

Coursera, Kaggle Learn Courses, and Udemy are in the top 3 list of preferred learning platforms (with Coursera being far more popular then the rest of the platforms in top 3 list)
University courses leading to a formal university degree are also quite popular (it takes the 4th rank in the list)
edX is behind its primary e-learning platform rivals
only a minor fraction of respondents indicate they did not begin or complete data science courses (this is the healthy indicator of the data science community here at Kaggle to have good learning and self-learning attitudes)

Online Learning Platform Preferences by Occupation and Programming Experience

In addition to the insights above, we find that

Coursera is the top platform of choice by every occupation and programming experience level groups
Kaggle Learn Courses are more popular with the inexperienced junior professionals (have <1 or 1–2 year of professional experience)
For more senior professionals (with experience level of 3+ years), Udemy is the second platform of choice (after Coursera), and they do not actively use Kaggle Learn Courses

Primary Data Analysis Tool Preferences by Occupation

We find that

Local IDEs is the primary choice of the majority of the survey respondents (it highlights quite good technical skills of the respective respondents)
Basic Stas software is the second preference of the respondents
Data Scientists do not often use Basic Stas software as opposed to other analysis tool options investigated
The rest of tool types goes far behind Local IDEs and Basic Stas software

Primary Data Analysis Tool Preferences by Occupation and Programming Experience

We can find that

technically savvy occupations are more inclined to use Local IDEs as their primary ‘data analysis’ tool of choice whereas less technical occupations (like Business Analysts) tend to use basic stat tools more
it is also noted that professionals with little programming experience (<1 year or zero experience) tend to choose basic stats softare whereas professionals with 2+ years of programming experience do not feel any fear of using local IDEs to serve their data analysis needs

Favorite Media Sources by Occupation

We find that

Kaggle as a medium of a valuable Data Science-related information it the top choice with the majority of the survey respondents, and it strongly out-performs other media sources investigated
the second best is Youtube content
popular Data Science blogs take the third place in the rank
other media sources are well below the top three media listed above

Favorite Media Sources by Occupation аnd Programming Experience

In addition to insights above, we find that

Almost in every occupation, people with programming experience of less then 10 years are funs of Kaggle as the primary information source
People with 10+ years of experience tend to rank other information sources as their primary go-to medium choice, with Blogs being of the popular ones
Junior Business Analysts (with 2 or less years of programming experience) rank YouTube as their primary go-to medium whereas more senior-level Business Analysts prefer Blog posts and Kaggle
Statisticians with 20+ years of experience prefer Journal papers/publications, and Statisticians with less years of programming experience look the information up in Blogs and Kaggle as their go-to choices
Product/Project managers rank Kaggle as their go-to information source, regardless their programming experience

Summary

In this post, we tried to draw the portrait of Data Science and ML industry in the upcoming years (2021–2022). We can see a lot of interesting trends to sharpen the industry landscape in the next two years.

The industry is prolificating, and its geography expands. Raise of interest in AutoML tools, along with the more pragmatic approach to them (AutoML as tools, not as a Holy Grail to tackle each ML problem without a human expert involvement), will be certainly boosting the speed of evolution of the industry.

Overall, Data Science and ML will remain one of the sexiest occupation fields in the next two years, despite the voices of sceptics.

References

This is the second article in my series about the insights from the data collected in Kaggle’s survey of ‘State of Data Science and Machine Learning 2020’ (https://www.kaggle.com/c/kaggle-survey-2020).

The first article in the series, “Cloud Computing, Data Science and ML Trends in 2020–2022: The battle of giants”, can be reviewed per https://medium.com/sbc-group-blog/cloud-computing-and-data-science-and-ml-trends-in-2020-2022-the-battle-of-giants-c2a174d3cd2b

Note: you can check the repo per https://github.com/gvyshnya/state-of-data-science-and-ml-2020 to see how every insight above has been discovered.

Data Science and ML Trends in 2020–2022: Portraits of the Industry and Raise of AutoML

Introduction

Who Are You, Mr. and Mrs. Data Scientists?

Where Are You, Mr. and Mrs. Data Scientists?

Age and Level of Education

Gender and Level of Education

Level of Education and Programming Experience

Where Are the Industry Veterans?

Organizational Environment and Enterprise-Scale Data Science/ML

Organization Size, Data Science and ML Adoption

ML Spending

Will the AutoML Age Commence Soon?

Using AutoML Tools: By-Purpose View

Using AutoML Tools By Purpose and Size of Employer Organization

Utilizing AutoML Tools By Product and User Occupation

Utilizing AutoML Tools By Product, Size of Organization, and DS Team Capacity

Utilization of Tools to Manage DS Experiments

Knowledge and Information Sharing Channels

Public Sharing Platform Preferences by Occupation

Public Sharing Platform Preferences by Occupation and Programming Experience

Kaggle vs. Colab Usage

Online Training Platform Preferences by Occupation

Online Learning Platform Preferences by Occupation and Programming Experience

Primary Data Analysis Tool Preferences by Occupation

Primary Data Analysis Tool Preferences by Occupation and Programming Experience

Favorite Media Sources by Occupation

Favorite Media Sources by Occupation аnd Programming Experience

Summary

References

Written by George Vyshnya