Welcome, 2022🎉. What Has Changed in Data Science in 2021?

Hasan Basri Akçay
DataBulls
Published in
7 min readJan 10, 2022

Best Data Science Tools, Methods, and Techniques such as Cloud Computing Product, Automated ML Tools, Courses, IDEs, Data Product, Data Analysis Tools, Business Intelligence Tools, Hardware, Deployments, Machine Learning(Computer Vision, Natural language processing) Models, Visualization Libraries, Programming Language and Media Sources For Artificial Intelligence.

Change and Chance — source here — image by Lee Zanello

Happy New Year Dear Readers 🎉. I hope you have a good year. After 2021 is finished, I looked at what has changed in Data Science. This article is about differences and what will happen in 2022 in AI.

The objective of this notebook is to explore the changes in data science over the years. Therefore, we worked on two different datasets that are kaggle_survey_2021 and kaggle_survey_2020. Kaggle survey 2020 has 39+ questions, 20,036 responses and survey 2021 has 42+ questions, 25,973 responses.

The article consists of 5 parts. Introduction, Data Preparation, Data Cleaning, Data Analysis, Conclusion.

Introduction

In the introduction part, we started simply importing libraries and datasets. The shape of df20 is equal to (20037, 355) and the shape of df21 is equal to (25974, 369). Note: there are questions in the first row.

Data Preparation

In this part, we created 3 functions that are used for simplification the datasets.

In the datasets, some questions have more than one column and function group_cols is used for grouping the questions. For example, Q24 is one group, and Q12_Part_1, Q12_Part_2, Q12_Part_3, Q12_OTHER are also one group.

Function part_cols_convert is written for converting the questions that have more than one column to one column. For instance, this function converts Q12_Part_1, Q12_Part_2, Q12_Part_3, Q12_OTHER to Q12 column.

The last function is dict_preparation that is used for matching the same question in 2020 and 2021. Of course in the datasets, some questions mean are the same but the questions are different. We solved that kind of problem with manual correction. For example, Q12 is “Which types of specialized hardware do you use regularly? (Select all that apply) — Selected Choice — GPUs” in 2020 and “Which types of specialized hardware do you use on a regular basis? (Select all that apply) — Selected Choice — NVIDIA GPUs” in 2021

After all preparation, we combined survey 2020 and 2021 by function prepare_data.

Data Cleaning

Data cleaning is one of the most important parts of data science. As with most datasets, this dataset needs data cleaning. According to my view, some answers were split like ‘Product/Project Manager’ to ‘Program/Project Manager’, ‘Product Manager’ and some answers have been fixed like PostgresSQL to PostgreSQL in 2021. In the below, we tried to match the same answers.

Data Analysis

In this part, we plotted all questions for visual pieces of information. We created 2 functions.

Function long_sentences_seperate is used for visual editing. For instance, if a question or an answer text is so long for plotting, this function splits the text by adding ‘\n’ to the text.

The barplot_all_cols function is used for plotting all columns. For color, we selected the ‘years’ column.

DS_col = ['Q11', 'Q32', 'Q3', 'Q4', 'Q1', 'Q38', 'Q13', 'Q30', 'Q6', 'Q25', 'Q5', 'Q8']

barplot_all_cols(df_20_21_clean, question_mean_dict, DS_col)
barplot_all_cols(df_part_20_21_clean, question_mean_dict, df_part_20_21_clean.columns, figsize=(24, 192))

Conclusion

You can see full python code here 👉 Kaggle Notebook.

In 2021, all usage of cloud computing increase. The most famous cloud computing tools are AWS, GCP, Microsoft Azure. The cloud computing process will be used more in 2022 ☁️.

Business Intelligence Tools are also increasing. Microsoft Power BI increase 462 to 790 and Tableau increase 540 to 740. We can say Microsoft Power BI will be used much more than Tableau in the future.

In 2020, the most common age was 22–24 in Data Science. Now, it is 18–21. Welcome young Data Scientists 👋. Also 70+ increase from 76 to 128. The range of age of data scientists getting bigger. On the other hand, the answer that is “I have never written code” is decreased even if 18–21 age is increased in 2021. We can say “Code age is decreased” 🔥.

The number of Data Scientists increased around all of the worlds. Most increase in China 🌍.

In the usage of TPU, 2–5 times increase 2012 to 3405, 6–25 increase 424 to 947 and 25+ increase 272 to 612. We can say “We will hear the name of TPU much more in 2022”. Usage of GPU decreased from 8309 to 8035 and TPUs increased 960 to 3451. In 2022 TPUs can be more used than GPUs 📱.

Ungraduated and Bachelor’s degree increased but Professional doctorate decreased from 699 to 360. That is almost half. This situation can be caused by the Kaggle survey. Maybe data scientists that have Professional doctorates, stoped using Kaggle 📚.

In general, usage of all of the data products increased. Most increase in MySQL. In that article (What Are The Differences Between Data Scientists That Earn 500💲 And 225.000💲 Yearly?), it was also said that databases are so important for data scientists.

In general, all of the jobs increased but Business Analysts and Statisticians can be assumed to be unchanged. Now, we have a new job title that is Developer Relations/Advocacy 💼.

In, Hosted Notebook Products, Binder/JupyterHub decreased from 2072 to 1770 and Kaggle Notebooks increased 5991 to 9506, Colab Notebooks increased 6329 to 9792. The most increase is in Google Cloud Notebooks (AI Platform / Vertex AI) 📓.

All usage of data visualization libraries increased and the most increase is in Seaborn📊 8821 to 12586. In IDEs, all usages of IDEs increased but the most increase is in Visual Studio / Visual Studio Code 2445 to 14150. The second is in Jupyter 11210 to 21720.

In ML Frameworks, Tensorflow, Pytorch, and Xgboost all increased but the most increase in Xgboost 3935 to 5974, CatBoost 957 to 1512, and JAX 84 to 190. Also, we have new selections that are Huggingface and Pytorch Lightning. In Computer Vision, all usage of computer vision algorithms increased but the most increase is in CNN 2003 to 2740 and GAN 1092 to 1492. In Natural Language Processing (NLP), all usage of NLP increased but the most increase is in BERT 1428 to 2351.

Another question is about programing language recommendations. In this question, only Swift decreased. The most increase in SQL. In the use of programming language, Python🐍 is the most famous and the most increase in Javascript 2995 to 4332.

In the question that is about the most important part of work, the most percentage is in analyzing and understanding data to the product or business decisions 6420 to 9107. 35 percent of data scientists gave this answer and the most increase is in the “None of these activities” answer. There can be a new role that is not clear yet in data science.

Automated ML Tools are mostly used in ML pipelines in 2021. Probably it is still will be used ML pipelines in 2022. The most increase is in Databricks AutoML 948 to 1970 and Google Cloud AutoML 2839 to 5567. Also, Google AutoML is the most famous, and Amazon Sagemaker Autopilot, Azure Automate Machine Learning are new in Automated ML Tools ⚙️.

In the question of the course, the most increase is in Kaggle Learn Courses but this data is not trustable because the survey belongs to Kaggle. Other important courses are Certification Programs(AWS, Azure, GCP, etc) increased from 1076 to 1804 and LinkedIn Learning increased from 1617 to 2093. Also, in general, spending money for ml increased 📚.

In the question that is about sharing or deployment, the most famous tool is Github that increased 3434 to 4586. The most increase is in Streamlight 186 to 387, Kaggle 1878 to 3065, and Colab 1247 to 1848.

👋 Thanks for reading. If you enjoy my work, don’t forget to like it, follow me on Medium and LinkedIn. It will motivate me in offering more content to the Medium community! 😊

References:

[1]: Hasan Basri Akçay, 2021, https://www.kaggle.com/c/kaggle-survey-2021
[2]: Hasan Basri Akçay, 2020,https://www.kaggle.com/c/kaggle-survey-2020
[3]: Hasan Basri Akçay, 2021, https://www.kaggle.com/hasanbasriakcay
[4]: Hasan Basri Akçay, 2021, https://www.kaggle.com/hasanbasriakcay/what-has-changed-in-data-science/notebook

Diğer Yazılar:

--

--