Data Science Fundamentals for Product Managers

10 min readMar 5, 2018

Few weeks back we had organised an event around the theme — Data Science Fundamentals for Product Managers. Chinmaya Behera who has worked as a Data Science Product Manager in Naukri.com and RedBus spoke wonderfully well on this topic. We are capturing his thoughts in the blog below. We hope all product enthusiasts find this useful and that this could act as a great primer for product folks interested in the Data Science and Machine Learning domain. We would like to thank Chinmay for taking time out and capturing his thoughts in this blog.

Before we delve further, for the uninitiated, CohortPlus is the largest and the most active community for Product enthusiasts and Product Management professionals. We have close to 9000 product enthusiasts who interact with each other through rich and useful discussions and help in furthering each others career.

Link to register on our android platform — https://goo.gl/7yz7E4

Link to register on our website — https://cohortplus.com/

Without further ado, here are the details of Data Science Fundamentals for Product Managers.

Data Science for Product Managers

Data science is becoming ubiquitous with numerous products trying to leverage it in one form or the other. Even though the field is evolving, it is no more a fad and organizations have had multiple success stories around using some form of data science in their products. Product managers, by nature of their role, have to be in the forefront of understanding new technologies and how it can benefit the end users through products they build/manage. In this post, I would like to share what product managers can do to understand and get up to speed on data science and some of the key machine learning algorithms that form the backbone of many compelling products nowadays.

Data Science — An Introduction

In simple terms, data science implies the use of data and technology to make better decisions

Product managers should view data science as an approach that analyses large amounts of data, extracts patterns and insights from these data and make predictions to derive business value. One of the key components in the above figure would be ‘data products’ that work on humungous volumes of raw data, learn/extract patterns from the data and deliver value to the users, thereby improving business metrics.

Many machine learning algorithms form the brains behind these data products. Machine learning algorithms can be simply defined as programs that learn from data without being explicitly programmed

One way to understand machine learning would be to compare it with the traditional programming approach.

· In a traditional programming approach, one starts with data as inputs and then writes a set of rules/ logic as part of the program and gets the results as the output.

· The machine learning approach takes the historical data and the results as input and derives the logic or patterns between the data and the results. Subsequently, it generates a program that can be used with any future data to predict the result.

Common examples of machine learning at play would be –

· Recommendation Engines — These are algorithms that suggest items the users might be interested in, without the users explicitly searching for them. The suggestions could be in the form of products a customer might want or a movie he/she would like

· Spam Filter (E.g. Gmail) — An algorithm behind the scenes processes incoming mail and determines if a message is junk or not

· Object Detection (E.g. in autonomous cars) — Use of machine learning algorithms to recognize traffic lights, other cars on the road, pedestrians etc.

Rise of Data Science

Some of the popular machine learning algorithms ( mentioned in the previous section) have existed since 1980s[1], however they have come into greater prominence over the past few years only. Let’s understand some of the key factors that have led to the resurgence of these algorithms–

Data Availability

There has been an explosion in the creation and the sources of data. Nowadays, a lot more data is being collected than ever before –

• Web and Browsing data

• GPS / Location data

• Images and Videos

• User Generated Content (UGC)

• Devices with sensors

• Emails

Financial transactions

Efficiency of Algorithms

• The existing algorithms have become more effective with the data deluge

• Results have improved significantly with increase in training data

• Now it is also possible transfer learning[2] from one application to another

Diminishing Infrastructure Costs

· Availability of on demand cloud based services (E.g. Amazon Web Services , Google Cloud or Azure ) — the infrastructure speed, availability and sheer scale has enabled bolder algorithms to tackle more ambitious problems[3]

· In addition to availability of scalable servers on the cloud, lot of data sets are now open sourced by Govts and companies around the world, improving the accessibility of data to feed on to the algorithms

· Many of the popular machine learning algorithms have also been made available for the public (open source libraries/frameworks) leading to a wider adoption by the developer community

Machine Learning Basics for PMs

Now that we understand why machine learning algorithms have come into prominence in the past few years, lets dive deeper into some of the popular types of machine learning algorithms.

While machine learning is a deeply technical field, many of the fundamentals required to leverage it to create business impacting products or features have little to do with the complexity of the algorithms. As a product manager, one should

· understand the data collected from various customer touchpoints and the sources of data pretty well

· develop an understanding of common machine learning problem types — regression and classification

· learn to tie the results from a machine learning model/algorithm to business metrics

· define test criteria (e.g. A/B testing) to evaluate the degree of success or failure of a machine learning model

Types of Machine Learning Algorithms

Machine learning is simply a way of creating a program that does something (e.g. predict a value or classify an item into a category) without the programmer having to figure out how to do it. Typically, a lot of historical data is fed into these algorithms as inputs and the algorithms are robust enough to find complex correlations between these data.

There are many types of machine learning algorithms, however I will be mentioning few of the common ones in this post –

· Supervised Learning

This majority of practical machine learning applications use some form of supervised learning. In this type of learning, the algorithms learn from something known as a labeled data set. A labeled data set is a collection of historical data/records that is comprised both of inputs and the corresponding output or target value achieved.

As an example, consider the following data set[4] obtained from a banking institution

- The data sets are usually in tables having data items (say. bank customers) in rows along with variables (e.g. age, job, education, money balance) in columns.

- Labeled data sets also have target variables (labels), the values to be predicted in future data.

- In the above data set, the target variable defines whether customers have subscribed for terms deposit after a call or not.

Once the data set is fed into the algorithm, it learns to classify the outcome of the input variables as yes or no (whether a customer has subscribed for term deposit). Now this learning can be applied to any future data related to a new customer and predict the outcome (yes/no).

Two classic problems that are being addressed through supervised learning are –

- Regression

Predicting the numerical value of a thing is a regression problem. Example — Predicting how much a house will cost in a particular area (based on historical trends and other factors)

- Classification

Figuring out what kind of thing something is, is a classification problem. Gmail spam or not spam and Facebook photos (detecting faces) are examples of this type of problem. Classification could be a two-class or binary classification (yes/ no or spam/non spam) or a multi-class classification

· Unsupervised Learning

Unlike the case of supervised learning, the data sets for unsupervised learning do not have the target values and hence the data set is termed as unlabelled data. Here the algorithm tries to identify patterns in the data without the need to tag the data set with the desired outcome.

Some common problems that are being addressed through unsupervised learning are –

- Clustering

Grouping of items based on similar characteristics is the outcome of clustering. Grouping similar news items or similar customers based on their purchase behaviour would be practical examples of this approach.

Association-

Categorization of objects into buckets based on some relationship, so that the presence of one object in a bucket predicts the presence of another. A very common example would be the “people who bought XYZ also bought ABC” recommendation problem

Anomaly detection-

Identification of unexpected patterns in data that need to be flagged and handled fall under anomaly detection type of problem. Common examples where this is used would be fraud detection and health monitoring for complex systems (industrial machinery or network infrastructure)

Working with Data Science Teams

The data science team which deals with developing machine learning based products will be discovering and analysing data, defining features for the problem (feature engineering), selecting and optimizing algorithms and then putting machine learning into production for further testing.

As a product manager, one should have a good grasp of the machine learning model development process. A great resource to learn about the various stages is mentioned in the blog link associated with the figure below

Figure — Machine Learning Logical Flow[5]

There are a number of additional things that need to be taken care of while dealing with machine learning based products.

Building products with data requires a data strategy

• Most machine learning algorithms feed on a lot of data for training the models. Product managers should have a deep understanding of all the touchpoints for data generation, collection and its consumption in internal products.

• Another aspect to consider would be the use of data to improve the product algorithmically over time.

Machine learning model deployment

• PMs should work with the data science team in defining the features or inputs to the model during the feature engineering stage of a machine learning model development process.

• Product managers should have an understanding of how the model will work with real time data. One should take into consideration any new APIs to be developed to interface with the machine learning model in production environment.

• A critical aspect to consider for product managers would be the frequency with which the machine learning model is to be retrained — whether its daily, weekly or every X days needs to be well thought through. While a machine learning model will improve over time with the data it is trained on, there is a trade-off between the efforts and infrastructure required to train a machine learning model and its performance based on the amount of data(and how recent the data is) it has been trained on.

• Speed with which machine learning output changes matters, depending upon how the end user interacts with it. For example, a ML algorithm taking two seconds to generate recommendations might be more suitable than an algorithm taking twenty seconds to generate recommendations on an e-commerce product description page.

Evaluating machine learning models

• A product manager should act as an expert translator when it comes to data science projects and how they fit into business needs.

• PMs should also develop skills for interpreting machine learning metrics(e.g. accuracy, loss) to product metrics and vice-versa

• Customer research needs to be done to assess what an acceptable accuracy is as well as what failure cases are expected versus which ones will not be tolerated.

Getting Started with Machine Learning

While the field of machine learning has been evolving rapidly with newer algorithms developed and deployed faster than ever before, the fundamentals still remain the same. For a product-centric approach to understanding machine learning, I would recommend the following –

· Introduction to Machine Learning by Andrew Ng is an online course on Coursera and this is by far one of the best starting points for developing deep understanding of machine learning. Though the course is highly technical, it will help you greatly for years to come.

· Develop a good understanding of basic machine learning models and their output (metrics, curves, distributions). Even though you might not need to learn the intricacies of how the algorithms work, you should develop a sense of how their performance metrics are measured and evaluated.

· Leverage Kaggle Competitions

- Kaggle is a platform for data science competitions where companies upload their datasets and the problems they are trying to solve

- Participants experiment with different techniques to produce the best models

- One can explore competitions on Kaggle to understand different use cases for machine learning, try and figure out the type of machine learning problem (regression / classification etc.)

- Download Kaggle data sets and understand the types of input features and target values

Additional Reading

Creating products that use machine learning is an increasingly multi-disciplinary activity. While this article might be treated as a starting point to on-board yourself on to the machine learning bandwagon, here are some more resources for related reading -

1. Kaggle Competitions

2. Machine Learning: An In-Depth Non-Technical Guide by Alex Castrounis

3. Machine Learning is Fun! by Adam Geitgey

4. The What, Why and How of Recommendation Systems by Matias Longo

5. Jason’s Machine Learning 101

[1] Aggarwal, Alok (2018, January 20). Genesis of AI: The First Hype Cycle. Retrieved from https://scryanalytics.ai/genesis-of-ai-the-first-hype-cycle/

[2] Ruder, Sebastian (2017). Transfer Learning — Machine Learning’s Next Frontier. Retrieved from

http://ruder.io/transfer-learning/

[3] Hoojat, Babak (March, 2015). The AI Resurgence: Why Now?. Retrieved from

https://www.wired.com/insights/2015/03/ai-resurgence-now/

[4] UCI Machine Learning Repository. Retrieved from

http://archive.ics.uci.edu/ml/datasets/Bank+Marketing