photo credits: Burak Kebapci

Curriculum for Product Data Science

Emad Khazraee
6 min readDec 22, 2021

The terms data science and data scientist currently come with different expectations depending on the context and the company. Since data science gained popularity a few years ago, the field has grown and become more mature; therefore, currently, you see different flavors of data science around. In some companies, data scientists are mostly focused on product analytics and rarely do any machine learning work. In other companies, they do “full-stack data science,” meaning the end-to-end work (from ideation to execution) including deploying models into production. Some companies created different flavors of data science such as product analytics, algorithms, AI, etc to differentiate among focus areas. All of these flavors touch 4 subject areas, statistics, machine learning, coding, and product sense. What differentiates these flavors is the weight of each area in the day-to-day work of the scientist.

ML engineering, data science of algorithms, or AI have more focus on software engineering and machine learning, while product data science (data science of product analytics) has more focus on applied statistics, experimentation, and product sense.

Since I did my transition from academia to the industry as a product data scientist, I received many requests from fellow academics on how to make such a transition and what they need to know to make a successful transition. I put together a concise curriculum three years ago. I also coached a few people who have successfully transitioned to different industry jobs in data science in the past three years. I usually update the curriculum when I learn about new resources. Recently, I shared this curriculum internally with my colleagues at work and received very positive feedback. My colleagues encouraged me to publish it as a blog post for those interested to become a product data scientist. So, here you go!

Product Data Science

For product data science you deal with all four areas of data science: Applied Statistics, Applied Machine Learning, Coding, and Product Impact and Experimentation. This curriculum is designed based on the competencies required for the job while taking into account what you need to pass the interviews.

Applied Statistics

In this area, you need to demonstrate that you have a good grasp of basic statistic and probability concepts such as:

  • Distribution of random variables
  • P-Value, Confidence Interval, standard error, power, Type I and II errors.
  • Basic probability, conditional probability, and Bayes theorem
  • Randomized controlled trial (A/B testing)

You also need to understand how to use statistical tests and their underlying assumption such as:

  • t-test, ANOVA
  • Chi-Square test
  • Linear and logistic regression

The best resource for these concepts is OpenIntro Statistics Book 4th Edition (freely available). It is an introductory book but does an excellent job offering a profound conceptual understanding of basic statistical knowledge. The book also has links to short video lectures if you do not have time to read all the text. This book is also the basis for the Duke’s Statistics with R on Coursera. If you want a math-heavy version, All of statistics: a concise course in statistical inference by Larry Wasserman is a good option, even the first few chapters.

Applied Machine Learning

Many product data scientists never use machine learning; therefore, some companies do not assess you for this area. In other companies, product data scientists may do machine learning work but it is usually aimed at understanding and suggesting new potential product changes. These scientists usually do not put their models into production; if they do so, they usually work with machine learning engineers to make sure the models meet the system level requirement such as latency.

To successfully pass the machine learning interview for product data science you need to understand the foundations of supervised and unsupervised learning and their applications. It usually focused on what I call shallow learning:

  • Linear models (Ridge, Lasso, PCR, PLS)
  • Tree-based models
  • Ensemble learners such as random forest and Gradient Boosted Trees

You also need to demonstrate a solid understanding of:

  • Feature engineering
  • Model evaluation: Confusion matrix, split validation, cross-validation, one-hold out; train, validate, and test processes
  • Evaluation measures: precision, recall, accuracy, F-Score, sensitivity, specificity, Cohen’s Kappa, ROC curves, AUC
  • Bias-Variance Tradeoff (see this and this) and diagnostics
  • Practical modeling issues: Hyperparameter tuning (what are hyperparameters for the random forest, XGboost, Lasso, and Ridge regression); imbalance data issues; down-sampling, up-sampling, bootstrap, and hybrid methods

One of the best resources for applied machine learning is still the Andrew Ng course (yes, I know it is relatively old and is based on MATLAB/Octave but it gives you a very solid foundation of ML). There are two books that also cover all of these topics in much more depth: Introduction to Statistical Learning (freely available), and Applied Predictive Modeling. Machine Learning Yearning by Andrew Ng provides practical advice on ML projects. If you want to expand your ML knowledge and learn about deep learning I recommend Deep Learning Specialization by Andrew Ng or Deep Learning with Python (or R) book.

Coding

or product data science the coding is mostly focused on data wrangling and working with data sets, or SQL databases. You need to be comfortable with ETL (extract, transform, and load) processes such as:

  • Importing data from raw formats (e.g., CSV), or a database
  • Cleaning data
  • Joining tables and filtering data
  • Filtering dates and time; and other variables
  • Aggregate data and summarize (group by and aggregate functions)
  • EDA and understanding overall data insights

There are many resources for R and Python to cover these topics. For those working with Python, Python for data analysis, 2nd Edition by Wes McKinney is a great resource.

To learn about SQL there are also many resources such as:

To get a sense of what is required in coding or data wrangling see the appendix below on the coding exercise.

Product Impact (Sense) and Experimentation

This area concerns how to make decisions about performance, growth, and improvement of a product. It is mostly focused on how we can collect data about user interaction with products, define metrics, develop hypotheses, and then design experiments and test hypotheses to improve the metrics through A/B testing. There are a few resources in this space such as:

There are many many blogs out there covering these topics.

Final Word on Interviewing

That should cover all bases for you if you plan to move to product data science. You also need a lot of practice and interviewing. If you are new to interviewing in tech you may feel discouraged after a couple of bad initial interviews. I am here to tell you that is completely OK and actually expected to bomb the first couple of interviews. Do not get discouraged. Reflect on what went well and what did not go so well after each interview. Write down some notes and identify a set of action items that you can do to improve your performance and bridge the gaps in your knowledge. If you persist, I am sure you will succeed.

Good Luck!

Appendix: Coding Exercise

Here is a coding exercise courtesy of my esteemed colleague Ehsan Fakharizadi.

Download the following IMDB datasets from https://datasets.imdbws.com

See whether you can answer the following questions:

  • Which actors co-appeared in movies with Robert De Niro?
  • What is the average rating of movies of those actors?
  • In which movies one of the actors passed away during filming?
  • Which movie genres have the most actors with an average age below 20?
  • Who are the actors with the most co-appearance in movies?
  • What is the average rating of movies by directors’ age?

--

--

Emad Khazraee

Data Scientist, Sociotechnical Researcher, and Ex-Architect