Data Science and ML Trends in 2020–2022: Portraits of the Industry and Raise of AutoML

George Vyshnya
SBC Group Blog
Published in
15 min readFeb 1, 2021
Knowledge is power — Francis Bacon

Introduction

The key asset of every knowledge-intensive industry (Software Development, Data Science and ML Engineering inclusive) is people. The way they operate, learn, collaborate, and share knowledge is essential part of whether the projects and initiatives are set for success.

At the other hand, the organizational environment is the industry glue that enables or equally disables human power contributions to the certain extent.

Everything mentioned above will affect the shape of Data Science and ML industry in 2021–2022.

The Data Science Fellow cohort photo (reused from here)

In this post, we are going to look at ‘Portraits and Landscapes’ of Data Science and ML that would determine the industry trends in 2021–2022. It will cover

  • Human capital of the industry (demography of ML and Data Science professionals, their education level, job responsibilities as well as the way they learn, collaborate, and share knowledge with each other)
  • Organizational environments to affect ML adoption in business and public sector
  • Usage of automation tools (AutoML products, software to manage various Data Science and ML experiments etc.) by the industry professionals

The insights conveyed by this article are based on the analysis of the data collected in Kaggle’s survey of ‘State of Data Science and Machine Learning 2020’ (https://www.kaggle.com/c/kaggle-survey-2020).

Note: Kaggle (www.kaggle.com) is a global community made up of data scientists and machine learners from all over the world with a variety of skills and backgrounds. The community has around 3 million active members. Although it is not rigorously representative of the entire population of Data Science and ML professionals across the globe from the sociological perspective, it still constitutes the significant fraction of the practitioners and professionals in the field. Therefore, the results of the survey can really draw the projections of where the Data Science and AI/ML industry is likely to evolve in the next couple of years.

Who Are You, Mr. and Mrs. Data Scientists?

As we can see from the survey results, the shape of the Data Science and ML industry was shifting in 2020. It is also predicted to see such a shift extended through 2021–2022.

The industry becomes more extensive and younger. The number of the industry practitioners grows, especially with more juniors (like students or fresh graduates) taking the path in Data Science and ML.

We also see the percentage of women contributing to the industry is growing in 2020 vs. 2018 and 2019, and it is likely to go on this way in 2021–2022.

At the same time, it highlights the average experience level of the survey participants to be relatively low. The majority of the survey respondents fell into the following clusters

  • Professionals with less then 1 year of programming and less then 1 year of ML experience
  • Professionals with 1–2 years of programming and less then 1 year of ML experience
  • Professionals with 3–5 years of programming and less then 1 year of ML experience
  • Professionals with 1–2 years of programming and 1–2 years of ML experience
  • Professionals with 3–5 years of programming and 1–2 years of ML experience

It indicates the industry is actively expanding now. 2020–2022 are going to be the years of further proliferation of the industry. Students and junior professionals will get the relevant experience in 5 years from now, and then it will change the landscape of Data Science and ML fields.

Where Are You, Mr. and Mrs. Data Scientists?

In terms of the geography, we see that

  • The most the survey respondents reside in India
  • The US holds the second place in the country rank
  • The next places in the top rank list are held by Japan, Brazil, UK, China, Russia, and Nigiria, respectively
  • Developed nations in the Old World (UK; EU nations like Germany, France, Spain, Italy; Turkey), Canada and Australia are quite behind the countries in the top rank list
  • At the same time, some of the developing nations (Indonesia, Mexico, Pakistan) and Taiwan are on a par with the developed nations in the Old World that are not in the top rank list
  • South Korea and China are surprisingly under-represented among the survey participants

Note: Under-represented population of Data Scientists and ML professionals from China in Kaggle may represent some sort of ‘isolationism’ of such professionals that tend to contribute to their local Data Science platforms.

We can also see clear ‘challenger’ nations in terms of the growing number of Data Science and ML professionals (Brazil, Nigeria). When the professionals there get more experience (they are not as experienced as their counterparts in the US, Old World and India are), these countries will bring new surprising AI and ML technology innovations to the global scene.

Age and Level of Education

We can see that there are four biggest Kaggle population clusters in the space of Education level and Age as follows

  • Bachelors of the age of 18–21 (the biggest cluster so far)
  • People of the age of 25–29 with the Master degree
  • Bachelors of the age of 22–24
  • Masters of the age of 22–24

On top of the population cluster insight above, we can draw the additional intelligence as follows

  • People with Bachelor degree predominate in more junior age groups
  • In senior age groups, the percentage of Master and Doctoral degrees grows
  • People with Master degrees predominate in the age groups of 25–29 and older
  • Starting the age group of 35–39 and older, the amount of respondents with Doctoral degrees is only a little less than the respondents with Master degree for the same age group

Gender and Level of Education

As we can see, there are top clusters of the survey respondents within the dimensions of Gender and Level of Educations as follows

  • Males with Bachelor’s degree
  • Males with Master’s degree
  • Males with Doctoral degree
  • Females with Bachelor’s degree
  • Females with Master’s degree

Bachelor and Master degrees are the most common levels of education for every gender.

Level of Education and Programming Experience

As we can see there are four top clusters among the survey responders within Education and Programming experience dimensions as follows

  • People with Bachelor’s degree having 1–2 years of programming experience
  • People with Master’s degree having 3–5 years of programming experience
  • People with Bachelor’s degree having 3–5 years of programming experience
  • People with Master’s degree having 1–2 years of programming experience

For more experienced categories (5+ years of programming experience), we find that the ratio of people with Master’s degree exceeds one for Bachelor’s degree.

Percentage of people with Doctoral degree is higher in the most experienced categories (programming experience of 10–20 years as well as 20+ years).

Where Are the Industry Veterans?

The location of the industry veterans (having 20+ years of programming experience and/or having 10+ years of ML experience) does not fully correspond to the countries with the biggest number of professionals in the field.

As we can see, the biggest number of programming veterans among the survey respondents reside in the US. The rest of the countries are well below the bar set by the US.

However, there are several countries that still stand out in terms of their population of the programming veterans participated in the survey. These are

  • Japan
  • UK
  • Brazil
  • India

As we see, India, although represented by the largest population of Kagglers among the survey respondents, does not possess the huge pool of the experts with 20+ years of programming experience vs. the rest of the leading countries.

As displayed above, the US absolutely predominates the rest of the world in terms of the number of the survey participants having 10+ years of ML experience.

Organizational Environment and Enterprise-Scale Data Science/ML

The important aspect of implementing Data Science/ML-related functions in the modern organization are predominated by (1) the organizational environment and (2) job responsibilities of the professionals involved.

The survey highlighted essential insights on how the factors below affect the corporate Data Science/ML processes. These are

  • Size of the employer organization
  • Size of the team working in Data Science-related areas
  • Adoption of ML in the organization
  • Individual job responsibilities of the survey participants
  • Spending on ML and Cloud Computing in the last 5 years

Organization Size, Data Science and ML Adoption

As we can see, the largest cluster of the survey responders work for the small organizations (0–49 employees). Within that bucket, most of the organization has either 1–2, 0 or 3–4 employees responsible for data science workloads.

Most of such organizations is still exploring ML methods. However, some fraction of such organizations

  • is reported to recently start using ML methods, or
  • do not use ML currently

Small organizations with no workers dedicated to data science workloads mostly do not use ML in their daily operations (which is a kind of expected intuition).

Interesting (small-sized) subclusters of small organizations with 5+ employees dedicated to data science workloads display mature usage of ML solutions in production.

Second and third largest clusters of survey responders constitute those who work for

  • organizations with 10000+ employees
  • organizations with 1000–9999 employees

In such clusters, we can see quite a lot of respondents to indicate their organizations to have well established ML methods.

Overall, irrespective to the organization size, it seems that the size of a Data Science-centric team of 5+ workers indicate certain level of ML methods maturity within an organization. In organizations in every size group, the companies with 5+ workers dedicated to Data Science activities indicated to either start using ML in production recently or have the mature ML methods established.

ML Spending

From looking at the charts, we can see the most popular option as for ML spending is USD 0, regardless the organization size. It may be indicative of the inaccuracies in the responder inputs when they responded to the respective question of the survey (or otherwise they could fake the answers for a reason).

Still, if considering the meaningful responses, it looks like

  • large organizations tend to spend more in ML (100+ k USD in the last 5 years)
  • smaller-sized organizations spent 100–999 or 1000–9999 USD in ML in the last 5 years

Will the AutoML Age Commence Soon?

We are going to discover how Data Science automation tools are utilized in the industry. It conveys findings as for

  • the usage of automated machine learning tools
  • the usage of tools to help manage machine learning experiments

Using AutoML Tools: By-Purpose View

We find that

  • the vast majority of the survey participants do not use any Auto ML tools in their daily routines
  • automated model selection tools is the most popular type of Auto ML tools used by the kagglers at the moment
  • automation tools for selecting a neural network architecture are not vibrant and well known
  • Data Scientists are the primary users of Auto ML tools (if in use)

Using AutoML Tools By Purpose and Size of Employer Organization

We find that

  • Both the large-sized and small-sized organizations are equally represented across the cluster of organizations where auto ML tools are not used now
  • For each type of auto ML tools, the most usage of them is observed in small-sized organizations (it is also correlated with the fact most of the survey respondents work for such organizations in fact)
  • Using AutoML Tools By Purpose and Size of Data Science Team in Organizations
  • Both the organizations with large-sized and small-sized data science teams are represented across the cluster of organizations where auto ML tools are not used at the moment
  • If AutoML tools used, the most use of them is observed in organizations where Data Science team has capacity of 1–2, 3–4, or 20+ workers

Utilizing AutoML Tools By Product and User Occupation

In addition to the insights above, we find that

  • For every auto ML product surveyed, professionals with 1–2, 3–5, 5–10, and 10–20 years of programming experiences are equally represented among the users
  • Professionals with 20+ years of programming experience are less attached to using auto ML products

Utilizing AutoML Tools By Product, Size of Organization, and DS Team Capacity

We find that

  • Small-sized organizations (0–49 employees) have more inclination to play around Auto-Scikitlearn and Auto-Keras
  • For larger-sized organization, the biggest cluster is always the subset of organizations that do not use any auto ML, regardless the size of their Data Science team

Utilization of Tools to Manage DS Experiments

We find that

  • Most of the survey respondents do not use any tools to help to manage data science experiments (so the market for such tools is extensively under-saturated)
  • TensorBoard is the leading tool in use at the moment
  • Data Scientists, Software Engineers, and Data Analysts are the top occupations of the users of such tools

Knowledge and Information Sharing Channels

Sir Francis Bacon

Knowledge is power, as Sir Francis Bacon told one day. Additionally, information becomes the power and ‘new oil’ in the modern post-industrial age.

Therefore it is essentially interesting to draw insights on how kagglers work with information as well as share/obtain professional knowledge.

In this section, we are going to review the preferences of the survey participants as for

  • Platforms and tools to publicly share or deploy their data analysis or machine learning applications
  • Platforms to take online data science courses
  • Primary tools they use to analyze data
  • Pavorite media sources that report on data science topics

Public Sharing Platform Preferences by Occupation

We find that

  • GitHub is the primary choice for Kagglers to share their work publicly
  • Kaggle itself as well as Colab take the next places in the top 3 list of platforms to publicly share the work assets
  • Good work sharing instruments like Streamlit and Plotly Dash are under-utilized by the community
  • Quite a big fraction of the survey respondents indicated they do not share any work publicly
  • The percentage of Business Analysts and people with ‘Other’ occupation who do not share their work results publicly is higher vs. the rest of the occupations

Public Sharing Platform Preferences by Occupation and Programming Experience

In addition to the insights above, we find that

  • GitHub and Kaggle are the preferred sharing platforms for every occupation and programming experience level groups
  • The percentage of professionals not sharing their work results publicly is higher with more experienced professional groups (people with 10–20 and 20+ years of programming experience) — the replied with ‘None’ more frequently then other experience level group members

Kaggle vs. Colab Usage

After purchasing Kaggle platform, Google started to own two similar technologies for data scientists to work and share their work results publicly in the cloud at a reasonable or even no cost (these are Kaggle notebooks and Collab notebooks). Both instruments are utilized by the industry professionals now.

We can see that

  • Kaggle is used more actively by males in the majority of occupations, levels of education and programming experience levels
  • The most tangible diff is observed for male Data Scientists with Master’s and Bachelor’s degree, with programming experience between 1 and 10 years
  • in a sense, we can think of Kaggle notebooks and Colab to compete over the same target audience

In the evidence of the above-mentioned data-driven insights, we can assume Google could benefit from re-positioning Colab from the functional and marketing stand-points, to drive more paying users to it.

Online Training Platform Preferences by Occupation

We find that

  • Coursera, Kaggle Learn Courses, and Udemy are in the top 3 list of preferred learning platforms (with Coursera being far more popular then the rest of the platforms in top 3 list)
  • University courses leading to a formal university degree are also quite popular (it takes the 4th rank in the list)
  • edX is behind its primary e-learning platform rivals
  • only a minor fraction of respondents indicate they did not begin or complete data science courses (this is the healthy indicator of the data science community here at Kaggle to have good learning and self-learning attitudes)

Online Learning Platform Preferences by Occupation and Programming Experience

In addition to the insights above, we find that

  • Coursera is the top platform of choice by every occupation and programming experience level groups
  • Kaggle Learn Courses are more popular with the inexperienced junior professionals (have <1 or 1–2 year of professional experience)
  • For more senior professionals (with experience level of 3+ years), Udemy is the second platform of choice (after Coursera), and they do not actively use Kaggle Learn Courses

Primary Data Analysis Tool Preferences by Occupation

We find that

  • Local IDEs is the primary choice of the majority of the survey respondents (it highlights quite good technical skills of the respective respondents)
  • Basic Stas software is the second preference of the respondents
  • Data Scientists do not often use Basic Stas software as opposed to other analysis tool options investigated
  • The rest of tool types goes far behind Local IDEs and Basic Stas software

Primary Data Analysis Tool Preferences by Occupation and Programming Experience

We can find that

  • technically savvy occupations are more inclined to use Local IDEs as their primary ‘data analysis’ tool of choice whereas less technical occupations (like Business Analysts) tend to use basic stat tools more
  • it is also noted that professionals with little programming experience (<1 year or zero experience) tend to choose basic stats softare whereas professionals with 2+ years of programming experience do not feel any fear of using local IDEs to serve their data analysis needs

Favorite Media Sources by Occupation

We find that

  • Kaggle as a medium of a valuable Data Science-related information it the top choice with the majority of the survey respondents, and it strongly out-performs other media sources investigated
  • the second best is Youtube content
  • popular Data Science blogs take the third place in the rank
  • other media sources are well below the top three media listed above

Favorite Media Sources by Occupation аnd Programming Experience

In addition to insights above, we find that

  • Almost in every occupation, people with programming experience of less then 10 years are funs of Kaggle as the primary information source
  • People with 10+ years of experience tend to rank other information sources as their primary go-to medium choice, with Blogs being of the popular ones
  • Junior Business Analysts (with 2 or less years of programming experience) rank YouTube as their primary go-to medium whereas more senior-level Business Analysts prefer Blog posts and Kaggle
  • Statisticians with 20+ years of experience prefer Journal papers/publications, and Statisticians with less years of programming experience look the information up in Blogs and Kaggle as their go-to choices
  • Product/Project managers rank Kaggle as their go-to information source, regardless their programming experience

Summary

In this post, we tried to draw the portrait of Data Science and ML industry in the upcoming years (2021–2022). We can see a lot of interesting trends to sharpen the industry landscape in the next two years.

The industry is prolificating, and its geography expands. Raise of interest in AutoML tools, along with the more pragmatic approach to them (AutoML as tools, not as a Holy Grail to tackle each ML problem without a human expert involvement), will be certainly boosting the speed of evolution of the industry.

Overall, Data Science and ML will remain one of the sexiest occupation fields in the next two years, despite the voices of sceptics.

References

This is the second article in my series about the insights from the data collected in Kaggle’s survey of ‘State of Data Science and Machine Learning 2020’ (https://www.kaggle.com/c/kaggle-survey-2020).

The first article in the series, “Cloud Computing, Data Science and ML Trends in 2020–2022: The battle of giants”, can be reviewed per https://medium.com/sbc-group-blog/cloud-computing-and-data-science-and-ml-trends-in-2020-2022-the-battle-of-giants-c2a174d3cd2b

Note: you can check the repo per https://github.com/gvyshnya/state-of-data-science-and-ml-2020 to see how every insight above has been discovered.

--

--

George Vyshnya
SBC Group Blog

Seasoned Data Scientist / Software Developer with blended experience in software development, IT, DevOps, PM and C-level roles. CTO at http://sbc-group.pl