Running a Data Driven Product Organization

Published in

ProductMan

16 min readJul 17, 2018

Somewhere between launch and IPO, consumer tech companies become data juggernauts. These companies measure everything, turn those metrics into insights, turn those insights into development, test that development against their metrics, and turn those tests into machine learning tools that recommend and tailor the experience for users.

We see this phenomenon, but even today few Product people understand using data, much less how to drive this kind of change in our organizations. Even fewer b2b PMs have ever had exposure to user data outside Marketing ad buys.

Let’s dig into this world of data, what it can do, and how to bring this discipline to your organization.

The Character of Data

When I say “data,” I mean data gathered in the process of a user using the app. This includes a stream of all event and all metadata around those events.

When you hear about Facebook having a “shadow profile” of non-users, this is information they have purchased from another source. These sources are credit agencies, retailers, and other companies that measure interactions with their services. I’m not going to go further on this topic, but I think our field needs to do some ethical soul-searching.

Correlation and Causation

The first thing to know about data is that it already happened. Looking at data is looking at a history book.

Would World War I have happened if no one had assassinated Archduke Franz Ferdinand? We can theorize, but we have no idea.

Does the assassination of a world leader lead to a World War? With some tunnel vision, you could make that hypothesis. But even if you found that 100% in the past, you couldn’t know that about the future.

Data, the history of your app and what people did there, can be correlated to certain actions, but can’t be deemed causal on its own. This is usually as close as PMs will get to real science, those correlations are hypotheses that must be validated through experimentation.

What Can Data Tell Us?

That’s the big limitation and most common misconception about data. But what can data really tell us?

What areas of the product are getting the most use and how
How we’re doing (key performance indicators, “KPI”)
The outcome of an experiment
Whether a feature “sucks”

The top three are pretty straightforward. The fourth is kind of magical. Like repeat business: if people try something once and never try it again, it probably sucks. If they keep using it and seem to engage more with the product afterward, it’s probably fine (gross over-simplification alert).

Collecting Data

Most software startups toss in Google Analytics and just have it silently puttering away in the background. GA is okay for broad trends, but it samples and over-counts by as much as 30% in my experience.

Segment is a data broker that you can hook into your frontend or backend to collect usage data. They’ll send that data off to other tools, but also let you store their data set in your own Redshift instance. This is a great way to bootstrap your event stream.

For the most precise numbers, you have to bake event tracking into your apps’ backend. The backend is closest to the database, which is the ultimate source of truth for what happened in your application.

What to Measure?

On a long enough time scale, the most important things to measure are those that make your company money. Facebook cares about users, because they want to serve ads to as many users as possible. Netflix cares about time spent watching content, because it means you won’t stop paying for the service.

At a b2b company with a direct sales team, this can be tricky. You can’t A/B test revenue, it’s too slow and the sample sizes always too small and biased. This is just an excuse, though, any PM should be able to come up with proxy metrics that they believe correlate to value.

Growth

Everyone cares about growth, these numbers tell you whether you can successfully get eyeballs. In consumer apps or “b2b2c” apps, growth is users. In a pure enterprise context, it’s customers. If you can’t do that, you’re done before you bother measuring anything else.

Users/customers — how many are coming in? where do they come from?
Retention — are people coming back 48 hours, one week, one month after joining? Different apps may care about different retention intervals.
Invites (k-factor) — how many invites are you getting out of each user? If you have multiple viral elements, which is most effective?

After that, I tend to break apps down into one of two categories: transactional or engagement. What they care about and how they measure it largely segment based on those categories.

Transactional Metrics

examples: Expensify, Amazon, eBay Now

Transactional apps rely on funnel metrics. Every new feature should build the top of the funnel (sign-ups), speed people through the funnel, improve chances of conversion, and/or inspire repeat business (retention)

Conversions — how many users are completing transactions in your app? How often are those those users completing transactions?
Funnel — how many people reach each step of the conversion process? How did they get there? (concrete example: how well does Amazon search prepare users to purchase what they found?)

Engagement Metrics

examples: Facebook, Instagram, Twitter, *ahem* Yammer

Engaging apps live or die on users coming back and taking actions that engage other users. Being there adds value to the network, the network being valuable means that they can sell ad space, sell the network to a company, etc.

The mechanics and classifications of importance vary wildly, but should ideally follow the 1% rule. Roughly:

1% of users will create content
9% will respond to the 1%, classical example being edit or modify
90% will be total lurkers

If you’re building an app to engage, it’s best to figure out the most important actions your users can take. Typically those actions that most correlate with retention, entice the 9% users to respond, and give you triggers to engage users outside of your app.

It’s not hard to guess at the initial engagement metrics of Facebook:

Posting thread starters, photos
Replying, liking
Tagging photos

Core vs. Feature Metrics

Apps will eventually build to some mix of transactions and engagement, but the core experience has to prioritize one. Social games, for example, prioritize engagement, because engagement eventually correlates to invites and completion of many micro-transactions. That prioritization plus growth eventually define your “core” metrics.

Those core metrics, comprising everything you currently believe to lead to long term success, are therefore inviolate. A multivariate experiment that decreases k-factor for a social app is just suicidal in 99% of cases.

Too often PMs get tunnel vision, “This feature is important and good and usage of this feature is valuable unto itself.” They’ll want to measure feature use and downplay or skip out on measuring the impact to core metrics.

Feature metrics are great for that release, but must be validated against a consistent set of core metrics.

What Not to Measure

Analytics software providers will happily sell you tools that spit out a bunch of metrics that have nothing to do with long term success. Some examples:

Clicks — clicks aren’t always positive, sometimes clicking around signals user distress or confusion
Page view counts — this old chestnut lead to all those listicles where you have to keep clicking to the next page. A painfully obvious perverse incentive.
Time on site — not terrible data in and of itself, but over-emphasized and also not objectively positive

It’s best to develop your core metrics before ever talking to a vendor for this reason.

Precision vs. Trends

focus more on the slope of the curve than the specific value of P

A note of caution here, particularly as you embark on the journey to becoming data driven: your numbers are probably wrong.

Don’t let this daunt you, a thousand users here or there isn’t going to sink the company. We’re not accountants.

A little wrongness is okay as long as your numbers are wrong in a consistent way. Most of what you care about are trends, as long as your wrong numbers still capture the curve, you can make the right decisions.

Insights

Once you have a data set you’re comfortable with, you can interrogate it. Ad hoc questions can empower other teams:

What are behaviors that correlated to paid upgrades?
Which customers haven’t logged in in a while?

Tell you what’s actually worth working on:

How many people view this page?
Should we build an android app? (How many users are attempting to use the site with Android phones?)

Cohort Analysis

Ad hoc questions can also help you extend core metrics in a way that makes sense for your business. This is the much ballyhooed “aha moment.”

The technique for finding behaviors that lead to retention is referred to as a cohort analysis. Here’s a simple example:

Cohort analyses collapse groups with some variation (horizontal values) and map them against a common time frame (the top line). The variable in this analysis is week joined, but cohorts can just as easily be feature usage, invite mechanism, middle name, whatever you hypothesize leads to higher rates of retention.

Dashboards and Reporting

In order to have the whole company internalize the data that drives our business forward, I want it everywhere. I want our core metrics on screens around the office and every member of every team to be able to tell me what they all mean.

Vanity metrics

You have to be careful with reporting visualizations that make you feel good about your data. A number like “User Signups Over Time” is only ever going to go up. These “vanity metrics” may make you feel good, but they don’t tell you anything useful about the business.

There are only two honest visualizations, a line graph and a bar chart. Anything else is a sales pitch.

Funnel Visualization

Your product, whether social or transactional is comprised of funnels. Funnel visualization is thus an extension of reporting.

How many people, when confronted with a choice, do the thing you want them to? How many people fall out? It’s important not to obsess over funnel optimization, but a little hill climbing is healthy.

Multivariate Testing

The best metrics aren’t just a big dump of stats and the best data people aren’t just statisticians, they’re economists. A site like Facebook isn’t just a bunch of people logging in and clicking on things, it’s an economy.

An island nation, if you will.

Okay, you’re the governor of Social Island and you’re facing a serious trade deficit. Your only hope is to open up your borders to tourism, to get people to stop spending money in other countries and spend it in yours.

How do you do that? How do you measure success of your programs?

Well, first you need to get some people coming. You launch a kickass PR campaign and get a few folks in the door. Cool, you’ve got some adventurous visitors, that’s a good start.

This is a modern age, though, direct marketing is only so helpful. You need these adventurous visitors to become advocates. You need them to bring their friends. So you want to track whether these people are telling their friends and how the next batch of tourists got there. Did they see their friends’ posts on Twitter, Instagram, Facebook, email? How do you encourage more of that kind of sharing?

Now, there are things to do on your island. You think they’re fun, but maybe there’s not enough to fill a week. Or maybe you’re not giving your visitors the feeling that they would miss something by not coming back.

So you introduce helicopter rides over Mt. Snowball.

Of course, you want to be scientific about this. If people come back for the helicopter, you want to make sure you’re capturing that. So you only hand out the flyers for the helicopter rides to 20% of visitors to Social Island. Do those people come back again soon? Do they even ride the helicopter or is it just an expensive hunk of junk sitting around that you have to fix all the time?

Regardless, the helicopter isn’t really the reason you want people to visit. You need cash. You need to understand the inflow of this cash. And you have to keep an eye on those people who got the helicopter flyers, make sure they’re at least not spending less cash on the day they’re going to the helicopter. Basically, if they’re on the helicopter, what aren’t they doing and is that other thing more important?

If people are coming back, telling their friends, spending more money, andriding the helicopter? It’s time to give those flyers to everyone.

So here’s your action plan, Governor. If you want to boost the economy on Social Island, you need:

People to show up
Repeat business
Referral business
For them to spend their money in the places that are most important for Social Island at any given time

And any new attraction you add needs to add to one of those without degrading the others (in as much as one is a priority for that phase of your tourism expansion).

What to Test

In a perfect world, you would test everything and every test would reach statistical significance. We don’t live in that world, though.

You can start with small samples, but understand that small samples are inherently biased. If you’re trying to make an inference about an app for billions, the thousands you’re testing on are unlikely to be representative of the whole.

Testing also has a cost, it impacts QA, code complexity, mental load for designers and PMs. A good heuristic for an early company is that a test should impact at least 30% of your user base until you’re beyond millions of daily active users.

How to Test

There are plenty of tools these days that can test simple variations. Changes in CSS, colors fonts, etc. or copy are exceedingly easy with these tools.

But most of the value will come from flow changes or larger rearchitecting of a given page. I have yet to find a tool that makes this easier without significantly complicating the process in some other way. By the time you build a flow change into Optimizely, you’ve done almost everything you needed to run the test without it. Just add treatment groups.

Frequentist Probability

A frequentist A/B test is what you would’ve learned in science class. There’s your product’s default state, the “control,” and the new thing(s) you want to build. You believe variation of a certain type influences your core metrics and seek to understand whether the data could exist if the “null hypothesis,” which states that there’s no relationship between the two, is true.

Accordingly, the result of frequentist probability is both a scoring of the results and a rating of significance. The “p-value” is the probability of getting this data, given that the null hypothesis is true. This is referred to as the “significance” of your result.

Most important in frequentist probability is pre-selecting the number of users you want to test with, your sample size. It’s counterintuitive, but p-values may rise and fall as the sample size grows. Running an experiment beyond the sample size could make results look significant that shouldn’t. This is called “p-hacking” and created a big scandal in the science community a few years ago.

The output of your test will look something like this:

*from* *Neil McCarthy*’s *Testing Yammer’s Signup Flow*

Invites went up. Yay! Days engaged fell off a cliff. Boo! Test failed. Start over.

Bayesian Probability

Bayesian probability is newer in the field of digital products, mostly because of Moore’s Law and the continual reduction in compute costs. Bayesian probability doesn’t necessarily differ in experimental methodology, but in the calculation and analysis.

The big change at the start of a Bayesian multivariate test is the concept of a prior. Earlier knowledge and expectation of the outcome. Bayesians accept that batting averages will bear some relation to the previous 100 years of baseball.

On the backend, in lieu of a confidence interval, Bayesian probability gives us a “credible interval.” This point of the credible interval is to bound the expected results of the experiment out in the wild.

Aggregates vs. Binaries

When testing, your results will fall into two broad buckets:

Aggregates — the number of times an action happened
Binaries — whether or not something happened

Neither measure is inherently good or bad, but binaries are a bit easier to reason about. Whether users in a treatment group sent invites tells you more than one treatment group inviting 500 more people than the other, as a single user could have been responsible for 499 invites.

Overtesting

The point of testing isn’t to replace intuition. Google was famously trashed by a former engineer for testing 41 shades of blue. At their scale, that test might’ve been totally reasonable.

But you’re not Google. Each option you add to your test will decrease the likelihood that your outcomes will reach statistical significance.

Local Maxima

I am going to make a bet with you. A million dollars says I can put a flag on the top of a taller object on planet Earth than you can.

Here are the ground rules:

The competition will last approximately two years
Once a flag is on the object, the other competitor can no longer put their flag there
We both start in San Francisco with $10,000. We can borrow more, but then we don’t win as much from the bet.

What would be your strategy?

Would you race me to the highest building in San Francisco, the Transamerica Pyramid?
Would you hop a plane to Chicago to claim the Willis Tower?
You could go to New York for One World Trade, but you might start to worry about the budget and the ability to counter my follow-up move.

Two years is a long time to ride elevators.

All of these are kinda small potatoes, too. They’re certainly easy, but they’ll get pricey quick. I could stay in California and try to summit Mount Whitney. It’s 12,000 feet higher than any building on the planet. And then what do you do?

Do you just leverage yourself to the hilt and go for Everest?
I could certainly never counter it, but your odds are extremely long. If you fail, how do you determine failure? How do you identify it early enough that you can adjust, pick something smaller and still compete?

I get asked a lot whether understanding usage data curses Product Managers to forever test small optimizations.

*from* *Local maxima and the perils of data-driven design*

This journey is your product. You want to get as high as possible as fast as possible, but you also want to know when you’re wasting time and money on the wrong hill.

You do that with data. Hit base camp, then climb to Camp I, use data prove that you’re making progress up that goddamn hill and then keep climbing.

If it looks like you’re leveling off, find another hill.

A Culture of Experimentation

It’s not easy for an organization used to command and control to transition to a culture of experimentation. It’s hard to admit that your ideas might be wrong and much harder to convince curmudgeon executives of the same.

It’s important, first to understand cognitive bias. I have made research into bias a focus of my personal study and it is truly a well of infinite depth. This Wikipedia page lists out known biases and bias families and it just keeps growing.

As you’ve likely noticed, I also teach in stories. One that stuck with me, though the source is long gone, is a study conducted at Google.

Google decided to check whether their very best PMs had better intuition than any other PMs in their organization. So they studied the experiments these PMs ran and whether they improved metrics, decreased them, or had no effect.
The result? Across the entire group, 1/3rd of the time the numbers went up, 1/3rd of the time they went down, and 1/3rd of the time, they were neutral.

The very best product managers at Google are wrong 67% of the time. Also important to note here that failure isn’t something to punish. Fear of failure means fear of risk, which leads to stagnant product.

Lastly, take care of your engineers. They work hard to bring your idea to life. The success or failure of a multivariate test isn’t a judgement of their work, their success is in shipping and the outcome of their hard work is either a new capability or a new insight.

Machine Learning

PMs should care about machine learning (ML) as much as our users, which is exactly as much as ML solves problems for them. ML to that extent is just another version of the “magical algorithm.”

Instead, bucket fields of machine learning and how they might fit into your toolkit.

Recommender Systems — recommender systems group like content, songs, items to purchase, etc. and develop cohorts of like users. If members of a cohort seem to like something that another member hasn’t tried, they recommend that content.
Natural Language Processing — most commonly known for bots and voice assistants. NLP is an aggregate of a number of different ML fields. NLP rolls up the technology stack of search, anti-spam, and neural networks.
Dimensionality Reduction — a little more esoteric, dimensionality reduction reduces variables in a large system. Optimizing for a given user cohort gets a lot easier when you can take a graph with 100 dimensions and reason about it in three.

Overfitting

One thing to look out for in ML is overfitting. Like our biased sample size above, a learning process that fits the data too closely may include or exclude some critical value.

It’s pretty easy to combat overfitting. In development, you hold back 50% of test data. Have the algorithm learn on one half and attempt to simulate the other half.

This is the substance of Kaggle competitions. I highly recommend their tutorials for understanding the data science toolkit.

Data “Driven?”

You might, at this point, be thinking, “Drew, you talked a lot about intuition, choices, company goals, and how they’re enhanced and validated by data. But it doesn’t seem like data drove anything.” You’re right. Data driven is a bit of a misnomer. You’re the driver, data’s just the engine.

It’s more accurate to say that I believe in an organization that is data informed. But, honestly, arguments over semantics are most often initiated by people who want to go back to making gut decisions.

Carry the data torch proudly into dark places, don’t worry about what it’s called.