Big Data Follies #0

Do you have the data you need?

Published in

Big Data Follies

3 min readOct 14, 2013

In my professional digs at Luminoso, I spend a lot of time trying to solve clients’ business challenges by leveraging their data. This series will chronicle lessons learned to help executives considering a big data project think about what their data can or cannot do for them. First up: do you have the data you need?

Say you want to predict which of your new products will be successful. At a minimum, you need a measurement of product success and some pre-launch data about the products in question. While data on more products will improve your results, you really do need both a success metric (sales, likes, impressions, etc.) and some sort of pre-launch information in order to draw many conclusions. A surprising number of executives try to undertake data projects like this without this data in hand — a version of this information may be floating around somewhere in the organization, but whoever is actually doing the project may have to dig to find it. This effort and accompanying delay gets worse if the digger is a consultant and has to rely on internal champions to get the information.Either way, the project won’t even get started until all the relevant data makes its way into the correct hands.

A different form of the same problem is having data that’s like the data you need, but isn’t quite on point. It’s worse in actual predictive work rather than exploratory analyses: if you only have a proxy for the variable you want to predict (website visits when you need conversions, sales aggregated by month rather than by day) you can probably do something, but you need to downgrade expectations of success — you don’t have the data to answer the question you’re most interested in, so you’re going to have to settle for answering a possibly-less-relevant question, and the answer to the new question may not give you the business outcome you were hoping for. You may not be able to detect the phenomenon of interest at all, because those big spikes in website visits convert at a low rate, or because the sales spikes you want to predict with your monthly data are over within a week.

Some guidelines on having the right data:

Before and after

If you want to measure outcomes of some event (like a marketing campaign), you need data from both before and after the event in order to do a comparison.

Cause and effect

If you want to predict an outcome (like sales) from inputs (like marketing budgets), you need data on both the inputs and the outcomes.

Measure the right thing

If you want to predict sales and you have sales data, great. If you want to predict sales and you only have data on visits, not so great.

Measure often enough

If you want to understand phenomena that start and finish over hours, make sure you have hourly data. Ditto days, weeks, or seconds.

Will it regress?

If you want to derive a relationship for prediction purposes, you need many examples of your phenomenon having happened in the past. It doesn’t have to be “big data,” but at least a few tens of examples would be in order.

So there you go: the zeroth step of solving problems with data is to have data applicable to your problem. Next time, we’ll talk about Science: does your relationship really exist?