7 steps to become a data scientist
2 months ago, I wrote an answer on Quora about becoming a data scientist at top tech companies. It has received more than 25,000 views and 200+ upvotes from various data scientists from Facebook, Quora, Apple, etc.
I decided to write an elongated version for those who are interested in data science. These were the 7 most common things I saw when I interviewed at big companies (Facebook, Intel, Square, eBay, etc) for data science related positions.
Basic Programming Languages
You should know a statistical programming language, like R or Python as well as their data science related packages. Some packages you should familiarize yourself with are Pandas and Numpy for Python, and DataComputing and Dplyr for R. You should also know a database querying language like SQL, and understand the more complicated usage of it (sub-queries and joining multiple tables).
You should be able to explain key terms such as null hypothesis, P-value, maximum likelihood estimators and confidence intervals. Statistics is an important aspect of data science. It is absolutely critical to crunch data and to pick out the most important figures out of a huge dataset: statistics is the act of condensing data in order to reveal the data’s core characteristics. This is critical in the decision-making process and to design effective experiments.
You should be able to explain K-nearest neighbors, random forests, and ensemble methods. Many of the machine learning techniques and classifiers typically are implemented in R or Python. These algorithms show to employers that you have exposure to how data science can be used in more practical manners. Machine learning is used in our every day lives — web page classifications, Facebook ad rankings, Google search optimization, AirBnB search rankings, Youtube’s related videos, etc. Machine learning and artificial intelligence is a growing field, and it is critical to understand its algorithms.
You should be able to clean up data. This basically means understanding that “California” and “CA” are the same thing or that a negative number cannot exist in a dataset that describes population. Data scientists are often handed raw data that have multiple types of redundancies and corruptions. Data wrangling is all about identifying such redundancies and corruptions and converting a raw dataset into a condensed version that is much easier to analyze.
Data scientist is useless on his or her own. They need to communicate their findings to Product Managers in order to make sure those data are manifesting into real applications. Thus, familiarity with data visualization tools like ggplot and Shiny is very important (so you can SHOW data, not just talk about them). I personally recommend use Jupyter Notebook as a convenient way to analyze data and visualize them at any given point. Data visualization allows data scientists to communicate their findings to others who may not understand how to properly read data.
You should know algorithms and data structures, as they are often necessary in creating efficient algorithms for machine learning. Know the use cases and run time of these data structures: Queues, Arrays, Lists, Stacks, Trees, etc. Software engineering is the basic foundations of data science. Having a clear understanding of different data structures will give you a broader arsenal of efficient algorithms to use when analyzing data.
This one is definitely debatable, but those who understand the product are the ones who will know what metrics are the most important. There are tons of numbers one can A/B test, so product-oriented data scientist will pick the right metrics to experiment with. Know what these terms mean: Usability Testing, Wireframing, Retention and Conversion Rates, Traffic Analysis, Customer Feedback, Internal Logs, A/B Testing. Data science is only useful when the “data” is creating positive changes to the product itself, or how the product is perceived and used.
In each field, I mentioned few buzzwords you should know about. There are tons of websites you could use, so I recommend using these 7 branches as a roadmap to guide yourself when learning about data science.
Welcome to Data Science!