Insights From Machine Learning and Data Science Survey 2020
Let’s learn about the current status of the data science field through data
Every year from 2017, Kaggle conducts an industry-wide survey that presents a truly comprehensive view of the state of data science and machine learning. The survey was live for 3.5 weeks in October and had 20,036 responses from over 55 countries and diverse demographics answering a wide range of questions ranging from frequently used ML algorithms, frameworks, cloud platforms, and products to the preferred programming languages and many others.
The age group of data science practitioners
Let’s delve into deriving insights from the survey starting with the different age groups present in the survey to which the data science practitioners belong.
The majority of data science practitioners are less than 30 years of age that constitutes more than 56% of the responses. The maximum number of responses in the survey came from the age group of [25–29] years (20% practitioners in this age group) in the survey.
There are very few practitioners 70+ years of age (less than 0.5%). Very young data science aspirants are entering the field. This can be seen from the responses of aspirants in the [18–21] age group that constitutes more than 17% of the responses.
Job titles of data science practitioners
The majority of data science practitioners are students (about 27%) followed by Data Scientists (about 14%) in the survey. A striking fact that can be observed from the above graph is that a significant number of aspirants are currently not employed (8.57%). These can be freshers or aspirants with very little experience in the field.
Very few practitioners have the job titles of Statistician (1.5%) and DBA/Database Engineer, constituting less than 1% of the responses.
Let’s ask some important questions from the perspective of data science methodology and answer them with the proper visualizations.
1. What is the coding experience of data science practitioners?
A significant number of aspirants are freshers (about 17%) having less than 1 year of coding experience.
More than 40% of data science practitioners have less than 3 years of coding experience. About 64% of practitioners have less than 5 years of coding experience.
On one side of the coding experience spectrum, there are practitioners (about 7%) having more than 20 years of coding experience, while on the other side there are around 6% of people who have never code.
2. Which programming languages do data science practitioners use regularly?
The results from the data science survey revealed that the top language used regularly by the practitioners is Python, used by 34.13% of the respondents followed by SQL used by16.56%.
Python and SQL dominate the preference scale, with a combined figure of more than 50% among those surveyed. R is 3rd in preference at 9.4% followed by C++ at 8.41% among practitioners/respondents.
3. Which integrated development environments (IDE’s) do data science practitioners use regularly?
Of all the Integrated Development Environments (IDEs), JUPYTER (JupyterLab, Jupyter Notebooks, etc.) is the most preferred IDE among the data science practitioners with 27.46% of the respondents using it regularly followed by VSCode with 14.39%.
PyCharm is used regularly by 12.49% of the practitioners followed by RStudio with 9.37%. MATLAB and Vim/Emacs are the least frequently used IDEs with less than 4% preference.
4. Approximately how many times did a data science practitioner use TPU (tensor processing unit)?
About 72% of data science practitioners have never used TPUs in their machine learning project ever. This could be because TPUs have entered recently into the Machine Learning domain and that this specialized hardware is used only in Deep Learning.
About 11.45% of practitioners have used TPU only once.
Only 1.62% of practitioners have used TPUs more than 25 times.
5. Which data visualization libraries or tools do data science practitioners use regularly?
Matplotlib is the most popular visualization library used by 34.04% of data science practitioners followed by Seaborn that is used by 24.59% of the respondents. Matplotlib and Seaborn dominate the preference scale, with a combined figure of about 59% among those surveyed.
Plotly and Ggplot libraries are almost equally preferred by about 11.5% of respondents. There are around 5% of practitioners who didn’t use any of the visualization libraries.
6. Which machine learning frameworks do data science practitioners use regularly?
Among Python frameworks used by data science practitioners, Scikit-learn is the most preferred framework regularly used by 26.41% of the respondents followed by Tensorflow that is used by 17.87% of the respondents surveyed. These two frameworks have a combined preference of more than 44% among the respondents.
Data Science as a field is continuously evolving with new tools entering into the field now and then. If you are thinking of entering the field of data science, it's best to know about the current status of the field.
Major insights that can be drawn from the above analysis are that very young aspirants are entering into the field and that the majority of them are students. Python is the most preferred programming language and Jupyter is the most preferred IDE of the data science practitioners. Additionally, very few practitioners use specialized hardware such as TPU in their Machine Learning Projects.
Data science practitioners prefer tools and frameworks based on different requirements. However, each tool or framework is chosen based on the requirement.
You can find all the code in my GitHub Repo
Link to the Kaggle Notebook
LinkedIn profile: Ankit-Kumar-Saini