The term data science was coined and made popular a little more than half a decade ago. When I transitioned into the field in 2013 from astronomy there was a lot of confusion about what is required to be a Data Scientist. Data Science has evolved since then and there are some themes have become persistent.
Product Data Science: This title varies in different companies, in some places this is pure business analytics and in others this role encompasses data modeling as well. The main skills for such a role are,
SQL : Being able to query data from tables is a primary requirement. Here is list of sites where you can learn more about SQL.
Apart from SQL having some basic knowledge of nosql databases such as mongodb, dynamodb, etc can be useful. Dabbling with Spark and Hive can also be useful for working with big data. Taking database courses from MOOC’s is also useful. Apart from being able to query data it is important to be able to slice and dice data, create charts and present it such that business insights are easily gleaned from it.
This requires a good knowledge of statistics and understanding the business.
Exceling in the later comes more with experience. Another important part of this role is building/evaluating A/B test (or multiple cell test ) pipelines. Most businesses will be testing new algorithms, design and product features in a regular basis. It is the product data scientists job to own this process in a regular basis. The product data science role has a lot of overlap with Business Intelligence.
Knowing some Python/R is also necessary in this role.
Data Science for Algorithms: The title of this role varies a lot in different companies. Some places only hire Machine Learning Engineers and typically they hire people with strong computer science skills combined with Machine Learning knowledge. Other companies have the role of a Research Scientist in machine Learning. The former gives more importance to CS algorithms than the later.
For both roles it is very important to understand and be able to solve problems from data structures, sorting , searching, recursion, dynamic programing etc really well. Some good sources for this are: Cracking the coding interview , leetcode, hackerrank , etc. If you are transitioning to data science from physics like me, it will make sense to take an online course and listen to videos solving specific problems to get the hang of it.
CS Algorithms is just one part of the interview process, Data Science roles typically require a good understanding of probability and statistics.
Concepts related to probability distributions, conditional probability, measures of central tendency is paramount to being successful in an interview.
Additionally estimation and hypothesis testing is key to understanding why and how your models are performing in different pardigms.
The most important part of an algorithm centric role is Machine Learning. The building blocks of Machine Learning is to understand how the basic algorithms work: Logistic Regression, Naive Bayes, Trees, Random Forest to mention a few. For a lot of these algorithms feature engineering is very important and determines the performance of the model. While studying these algorithms it is important to go deep and understand how the inner math works for atleast one algorithm.
Currently Deep Learning has gained a lot of importance because of its ability to produce extreemly high quality results in fields such as computer vision, machine translation and understanding human language. One advantage of using deep learning is that the models are able to handle any kind of features and the data scientist does not need to worry about feature engineering. With advent of open source packages such as tensorflow, pytorch, keras and others writing deep learning models have become quite easy.
Evaluation Metrics: A very important part of doing machine learning is being able to evaluate models and interpret them correctly.
Traditionally precision, recall, accuracy, F1-score and ROC curves are computed to understand the performance of a model.
Every situation is unique: you may have a multi-label classifier where some type of misclassification might be more acceptable than others. In this case you have define your own metrics that reflects your use case correctly. There is also the case of anomalous behavior when the data is very dis-balanced and metrics should be evaluated for each class separately to build a robust model. Applied Predictive Modeling has a few chapters dedicated to evaluation that is very useful.
Understanding Product: Product knowledge becomes quite necessary for product centric company, for instance there is a huge number of companies working with two-sided market places, while going for an interview at these companies it is important to work through few problems in this domain. This develops a good common sense for such products. Other products may be facing a particular type of customers (students, movie viewers, retail, etc) in that case it helps to understand the way of thinking of the customer base to do quick applications of modeling techniques to real world problems.
Communication: The consumers of results from data science are business, product, engineering or other specialized teams. In order to be a successful data scientist it is important to be able to provide results and communicate them in a way that they are useful to other teams. Developing soft skills to be able to convey what is possible and not possible with algorithms makes setting expectations easier.
Data Scientists wear many hats (engineers, business, product, …), this blog covers some of the basic skills necessary. There are many other parts to the role that people learn on the job. Many times it is not possible for one person to master everything and people decide to specialize in sub-fields (machine learning, analytics, databases, deep learning, etc.) which is also very valuable. It is important to enjoy work to be successful, finding a niche where there is demand as well as individual passion will make the process of transitioning into this new domain exciting.