Data Pitfalls —the Startup Edition

Monica Rogati
3 min readMay 11, 2016

--

(A version of this article originally appeared in the Lean Analytics book, which I recommend based on your click history.)

Startups love data. Over the past years, they’ve become better and better at collecting it at scale, analyzing it and building data products. As data and its role grows and evolves, a few common pitfalls emerge:

1. Assuming the data is clean: Cleaning the data you capture is often most of the work, and the simple act of cleaning it up can often reveal important patterns. Is an instrumentation bug causing 30% of your numbers to be null? Do you really have that many users in the 90210 zip code, or born in 1900? Check your data at the door to be sure it’s valid and useful.

2. Not normalizing: Is Chicago the most popular wedding destination? Technically, yes, if you simply count the number of people flying in for a wedding. But if your goal is to compile a list of “popular wedding destinations”, you need to consider the total number of people flying in — otherwise, your list will consist of airline hubs & big cities.

3. Excluding outliers: Those 21 people using your product more than a thousand times a day? Either they’re your biggest fans, or bots crawling your site for content. Whichever they are, ignoring them would be a mistake. (50 years ago pulsars were discovered by not ignoring noise in the data.)

4. Including outliers: While those 21 people using your product a thousand times a day are interesting from a qualitative perspective, because they can show you things you didn’t expect, they can be problematic when building models. You probably want to exclude them when building data products; otherwise, the ‘…you may also like…” feature on your site will have the same items everywhere. At LinkedIn, we called this the “Obama effect”.

5. Ignoring seasonality: “Whoa, is ‘Intern’ the fastest growing job this year? Oh, wait, it’s June.” Failure to consider time of day, day of week, and monthly changes when looking at patterns leads to questionable decision-making.

6. Ignoring size when reporting growth: (An investor favorite!) Context is critical. When you’ve just started, technically, your dad signing up just doubled your user base.

7. Data vomit: A dashboard isn’t much use if you don’t know where to look and what to do next.

8. Metrics that cry wolf: (A DevOps favorite!) You want to be responsive, so you set up alerts to let you know when something is awry in order to fix it quickly. But if your thresholds are too sensitive, they get “whiny”’ — and you’ll start to ignore them.

9. The “Not Collected Here” syndrome: Mashing up your data with data from other sources can lead to valuable insights. Do your best customers come from zip codes with a high concentration of sushi restaurants? This might give you a few great ideas about what experiments to run next — or even influence your growth strategy. At Jawbone, we looked at how weather affects activity levels, and applied those lessons to making the product better.

10. Focusing on noise: We’re hardwired (and then programmed) to see patterns where there are none. It helps to set aside the vanity metrics, step back and look at the bigger picture. After all, if you’re building something where there was nothing before, you’re making it infinitely better.

TL;DR: When it comes to data, it’s important to ask the right questions.

--

--

Monica Rogati

Data Science advisor. Turning data into products and stories. Former VP of Data @Jawbone & @LinkedIn data scientist. Equity partner @DCVC. CMU CS PhD.