Exploratory Data Analysis on Kaggle Machine Learning & Data Science Survey 2018

A very simple but useful insights on machine learning and data science survey conducted by Kaggle

9 min readDec 15, 2018

Introduction:
This exploratory data analysis is based on the survey data conducted by Kaggle on machine learning and data science in 2018. It is Kaggle’s second annual Machine Learning and Data Science Survey. The data set which has been published on Kaggle contains 23859 responses from 147 countries and territories. I would like to thank Kaggle for this type of survey which is extremely useful to understand the data science/machine learning community/industry across the world.

More about the data can be learnt from here. All the results shown here are the outputs on this data set.

Some important details about the data:

The data set contains 23859 rows and 395 columns.
Total 50 questions were asked in the survey.
The median response time for those who participated in the survey was 15–20 minutes.

Let’s start the analysis on the data set which is based on the “Sexiest job of the 21st century” (by Harvard Business Review).

First import the required libraries. Then load the data set. I have used Jupyter notebook to perform all the analysis here.

More females should join the data science domain:
There can be many cases in the medical science, health and other sectors in the various type industries where female data scientist might be more suitable. So it’s our responsibility to spread that message so that more females would like to join the data science domain in the near future.

Mostly young generation is leading the Data Science industry:
People who are in the age range 20–35 (in years) are the primary human resources for this booming industry. Also there are lots of experienced data scientist who are constantly supporting this young generation to become a more successful data scientist.

USA and India are the top two countries in this domain:
Almost 147 countries participated in this machine learning and data science survey and the interesting fact is that more number of data scientists are from USA and India followed by China.

Highest level of formal education: Most of them have a Master’s degree followed by Bachelor’s degree.

Engineers are dominating the industry:
Data Science/Machine Learning industry/community is dominated by engineers followed by mathematician/statistician.

Data Science/Machine Learning is a new and attractive field of research: Most of the people have experience between 0–10 years. So there are lot of opportunities to explore in the field of data science and machine learning.

Important role at work:
Analyze and understand data to influence product or business decisions followed by Build prototypes to explore applying machine learning to new areas are the most two important roles at work.

Primary tool used at work/school to analyze data:
Local or hosted development environments like RStudio, JupyterLab etc. followed by basic statistical software like Microsoft Excel, Google Sheets, etc. are the most used tool within the data science community/industry.

Most used IDE at work/school in the last 5 years:
Jupyter notebook and RStudio are the most used IDE in the field of data science and machine learning. Notepad++ is also very popular as a text editor among the community.

Most used hosted notebook at work/school in the last 5 years:
Kaggle Kernels followed by JupyterHub and Google Colab are the most used hosted notebooks in the field of data science and machine learning. Also there are many people who don’t use any hosted notebooks.

Most used cloud computing services at work/school in the last 5 years:
Amazon Web Services followed by Google Cloud Platform and Microsoft Azure are the most used cloud computing services in the field of data science and machine learning. Also there are many people who don’t use any cloud computing services.

Most used programming language on a regular basis:
Python followed by SQL and R are the most used programming language on a regular basis in the field of data science and machine learning.

Most used specific programming language:
Python is the most used programming language in the field of data science and machine learning.

Most recommended programming language to an aspiring data scientist:
Python is the most recommended programming language to an aspiring data scientist to learn first.

Most used machine learning frameworks in the past 5 years:
There are many popular machine learning frameworks. Out of them Scikit-Learn, TensorFlow and Keras are the most used machine learning frameworks followed by randomForest, Xgboost, pyTorch.

Most used machine learning library:
Scikit-Learn , TensorFlow and Keras are the most used machine learning library.

Most used data visualization libraries/tools in the past 5 years:
Matplotlib, Seaborn and ggplot2 are the most used data visualization libraries/tools in the field of data science and machine learning.

Most used specific data visualization library/tool:
Matplotlib is the most used data visualization library/tool in the field of data science and machine learning.

Coding is also important as well:
Coding is one of the most important key factor in the field of data science and machine learning. Without coding it is not easy to perform a high level data science/machine learning task. Also coding is not that tough as per as machine learning and data science is concerned (certainly it varies person to person).

Apple CEO Tim Cook said “it is more important to learn how to code than it is to learn English as a second language”. So you can imagine the importance of coding.

Machine learning is now one of the hottest research area:
Most of the people started using machine learning methods from last 4–5 years. So people are gradually using more machine learning methods in their work.

Most used cloud computing products at work/school in the last 5 years:
Amazon Web Services Elastic Compute Cloud (EC2) and Google Compute Engine are the most used cloud computing products in the field of data science and machine learning. Although there are many people who never used one.

Most used relational database products at work/school in the last 5 years:
MySQL is the most used relational database products in the field of data science and machine learning followed by PostgresSQL, SQLite, Microsoft SQL Server and Oracle Database .

Most used big data and analytics products at work/school in the last 5 years: There are several big data and analytics products available in the market but only few people use them. Out of the available products Google BigQuery is the most used big data and analytics product in the field of data science and machine learning followed by AWS Redshift, Databricks, AWS Elastic MapReduce and Teradata.

Data that you interact most often at work/school:
Numerical data, text data, categorical data, time series data and tabular data are the most frequently used data type by the data science and machine learning community.

Sources of public data sets:
Kaggle, Google, and Github are the most commonly used platforms to search for a public data set by the data science community.

Time taken by different phases of a typical data science project:
Data cleaning and modeling takes most of the time in a typical data science project.

Different types of machine learning/data science training: Most of the people are using online courses to learn machine learning and data science. Also there are many people who are learning by self or through work.

Most popular online platforms to explore data science:
Coursera is the most popular online platform to experience data science followed by Udemy and Datacamp.

online platform on which you have spent most amount of time:
Again Coursera is the most popular among all others.

Favorite media sources to explore data science:
Kaggle forums and Medium Blog Posts are the most popular media sources among all others.So I would like to thank Kaggle and Medium community for their services to all the data science enthusiasts.

Quality of online learning platforms as compared to the quality of education provided by traditional institutions: Most of the people accepts that online learning platforms are better to learn data science/machine learning.

Quality of in-person boot-camps as compared to the quality of education provided by traditional institutions: Most of the people accepts that in-person boot-camps are better to learn data science/machine learning.

Independent projects vs. academic achievements:
Independent project is one of the most important factor to showcase your expertise. Academic achievements are also important but not mandatory in many cases to showcase your knowledge and skills.

More machine learning jobs in the future:
Machine Learning is a buzz word in the 21st century. With respect to the following exploratory data analysis it is clear that only a handful groups of people are using machine learning into their business. If more and more industries in the near future starts using machine learning in the real world production then we can say that more machine learning jobs are coming as industries are gradually setting up their machine learning infrastructure.

References:

Exploratory Data Analysis on Kaggle Machine Learning & Data Science Survey 2018

A very simple but useful insights on machine learning and data science survey conducted by Kaggle

Written by Tinku Das