Where to start learning data science?
I held two meetup presentations in the past few weeks on this topic. There was always huge interest with lots of people asking for the slides and further advice, so I thought that this topic is worth a blog post.
The first question is why you may want to become a data scientist?
Besides the widely known fact that this is the sexiest job of the twenty-first century, there is appreciably an increasing demand in business in this field. A McKinsey study predicts that “By 2018, the United States alone could face a shortage of 140,000 to 190,000 people with deep analytical skills as well as 1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions.”
Data science seems to be the perfect career choice: if you reach a medium level of expertise, you will get a job for sure.
The traditional way of learning a discipline is to study at a university. The bad news is that universities aren’t able to keep up with the fast growth in demand. Fewer than one-third of U.S. News & World Report’s Top 100 Global Universities offer degrees in data science, and the average cohort size for one of these data science programs is just 23 students.
The situation is even worse in Hungary. There is not a single university where you could earn a data science degree of any level. What could be the reason behind it? The lack of qualified teachers is partially caused by the lure of what tech firms offer. University professors and also talented students who keep up with the latest developments in data science and have applicable, up-to-date knowledge on the hottest techniques are continuously tempted by various industries with salaries and work environments that universities cannot compete with.
Every fast-evolving field, like coding or IT, faces similar issues. In Hungary, there is a huge lack of IT teachers in primary schools, who have applicable expertise. And although there are IT and computer programming faculties in most of the universities, the knowledge that their course materials aim to transfer to students is mostly obsolete. You can get the knowledge which was required several years ago, but not the skills you need on the job market today.
Let’s go back to data science. How much is the salary that is so attractive that the majority of the best students and professors leave universities? According to the O’Reilly survey, the median annual base salary in the US is $104,000. Multiplying this amount with the number of predicted unfilled data scientist positions, we get $104000*140000 = $14560000000. This is how much the data science education industry is worth in the US alone. That is, this is the amount that companies would like to spend on hiring data scientist, but they can’t because of the lack of experts. This 14.5 billion USD is more than 10% of Hungary’s GDP.
No wonder that smart people already perceived the business value in this opportunity, and there are now several type of data science trainings offered by different sorts of organizations. Think on-site trainings, online trainings, re-trainings, and so on.
On-site trainings are usually held by experts who are actively working in business. This way of education provides more interaction, but the cost is usually high. On the other hand, there are several excellent online courses, where you can get all the required knowledge for free or at a low budget. And there are freer ways of getting in touch with the latest news and the community: meetups and free or traditional conferences. Finally, the best way to test your skills is to participate in a data science competition, for example, at kaggle or at drivendata.
Balabit’s data science team collected a personal top list of the best online courses, which we highly recommend for anybody who wants to study data science. Maybe this collection will contribute to the decrease in the shortage of data scientists.
R
If you are a novice at data analysis, statistics, and/or programming languages, then the best place to start is to learn R, which was designed explicitly for statistical computing. It is comfortable to use and has a huge community to answer any questions on social media and to develop every algorithm and method that is worth the effort. To support the learning of R, an awesome package called SWIRL was developed. The swirl webpage says: “The swirl R package makes it fun and easy to learn R programming and data science.” And I couldn’t agree more. Swirl provides an interactive way to learn R with encouraging feedback provided after each line of code you write. It was addictive for me, and resulted in a flow experience that was measurable. (I was testing a heart-beat analytical tool at the time.)
If you want to take a further step and apply R for more complex analysis, a great collection guiding you through the main milestones of data science courses is available on Coursera: The Data Science Specialization. Video lectures, assignments, and forums provide you with strong and effective support to learn the course materials and become able to use the knowledge you acquire.
The icing on the R learning cake is the data.table package, which you will need if you work with larger datasets and need effective data manipulations. The training edited by the author of the package is available at DataCamp: Data Analysis in R, the data.table Way.
Python
If you would like to use a general programming language, you could start with Python. The basics of Python coding (no data analysis included) can be learned very quickly, within a few days on Codecademy.
When you know how to use the basic operations in Python, you can advance further and study numpy and pandas, which are necessary if your purpose is to analyze data. A great course organized in a very logical structure is provided at Udemy: Data Analysis in Python with Pandas.
Scala
If you work with really big-sized data, which requires distributed computation, then you might need Scala and/or Spark. Scala is a general purpose language, particularly suitable if you need scalable software that makes use of concurrent and synchronous processing, parallel utilization of multiple cores, and distributed processing. Its functional nature makes it easier to write safe and performant multi-threaded code. You can get the basics of Scala programming at Coursera: Functional Programming Principles in Scala. The logic of coding is very different from R or Python, so the exercises in the course are challenging but at the same time they are fun to solve.
Spark
Additionally, if you need distributed computation and you want to use R or Python, you could use Spark. Apache Spark is a large-scale data processing engine, you can use it interactively from Scala, Python and R shells. EDX provides a popular introduction course to Spark usage: CS105x Introduction to Apache Spark.
Originally published at www.balabit.com on August 3, 2016 by Eszter Windhager-Pokol.