Are you doing data science, or #DataScience?

#DataScience is everywhere — it’s eating the world, promising to solve all of your business problems with a little data and a lot of technology. Unfortunately, it looks like actual data science is being lost somewhere in the hype.

I apologize in advance for the fact that I’m going to pick on a good company (Google) and a pretty awesome product (TensorFlow). But I have to, because holy cow did they do something really really bad in a marketing video. Worse, they pitched it as an example of the great things you can do with #MachineLearning (a wholly owned subsidiary of #DataScience).

Their analysis starts with a simple question:

If you knew what happened in the London markets, how accurately could you predict what will happen in New York? It turns out, this is a great scenario to be tackled by machine learning!
The premise for this problem is that by following the sun and using data from markets that close earlier, such as London that closes 4.5 hours ahead of New York, you could more accurately predict market behaviors 7 out of 10 times.

That’s quite a provocative claim! Did AI just uncover a massive market inefficiency to exploit? Is it time to let the robots take over trading? Let’s dig a little deeper.

We’ll start with the data. The two time series they use for this analysis are (a) the change in a UK stock index from one day to the next, and (b) the change in a US stock index from one day to the next.

Oh boy. I guess we don’t need to dig deeper after all.

Think about this for a second: by the time the UK market closes today and you have the data you need to make your prediction about the US market, can you still execute in the US market at yesterday’s closing price? If you can find someone who’ll let you bet on yesterday’s horse race you definitely don’t need TensorFlow to tell you what to do!

You may think that I’m being too harsh on a silly bit of marketing, or that my criticism is irrelevant to the type of data your business analyzes. But consider: if it’s so easy to get the analysis wrong on something as straightforward as comparing two time series of financial data, and so hard for perfectly smart people to see the error (no Google employee involved in this marketing piece seems to have spotted the issue), how confident are you that the analysis driving your next marketing campaign is sound, or that your latest A/B test drew the right conclusion?

If your company is trying to make decisions supported by data, don’t assume it’s being done right just because the tools are shiny and the words are fancy! Take the time to verify that you are doing real data science, not #DataScience:

  • #DataScience promises amazing results. Real data science promises nothing.
  • #DataScience starts formulating an answer. Real data science takes its time forming the question.
  • #DataScience plays with the data that’s handy. Real data science seeks out the data that’s needed.
  • #DataScience has a slick UI. Real data science is an awkward mess.
  • #DataScience adapts every problem to its framework. Real data science finds the right framework for every problem.
  • #DataScience has answers anytime you want them. Real data science has answers only when it finds them.
  • #DataScience has no memory beyond the last analysis. Real data science is accountable for the past.
  • #DataScience is a sprint. Real data science is a marathon.
  • #DataScience is flashy. Real data science is dull.

The more accessible and ubiquitous tools like TensorFlow become, the easier it is to produce #DataScience. And the more #DataScience is produced, the harder it becomes for the lay person to distinguish between good and bad data analysis.

The marketing behind these tools is designed to make you believe that all you need to do is line up a couple of time series, click a button and presto! Business magic. What they don’t tell you is that no algorithm, no matter how clever, can overcome a crippling data deficit. Returning to the Google video that motivated this whole rant, answering the question correctly would require intraday price data (which is hard to find and expensive to buy) from just after the UK close (which is tricky to flag because of non-synchronized daylight savings calendars, holidays, etc.) adjusted for a variety of factors (currency movements, execution slippage, and so on).

That’s hard. Worse, it’s boring. Who wants to do something boring like write scripts to normalize data when it’s so much fun to play around with AI?

It has always been easy to confuse mathematical gyrations for thoughtful analysis. “Garbage in, garbage out” may be a cliche, but it’s a cliche for a reason. What’s changed recently is that the barriers to performing those gyrations have been lowered to the point where almost anyone can be a #DataScientist.

Don’t be fooled. Don’t let #DataScience kill real data science inside your organization.

Show your support

Clapping shows how much you appreciated Peter Bonney’s story.