How Google’s financial predictor predicts the PAST.

Yesterday, Google’s TensorFlow team published a nice article describing how you can build a good predictor of the US stock market: TensorFlow Machine Learning with Financial Data on Google Cloud Platform

In their own words, their solution will:

“Use TensorFlow to build, train and evaluate a number of models for predicting what will happen in financial markets”

They clearly put a lot of effort into this article - interactive ipython notebook, two professionally edited videos, multiple pages on their website and blog posts. And the public reaction was unanimous “Wow!”. On twitter, in the comments, in the media.

The sad news is that this model doesn’t predict the future of these markets. It predicts mostly the past, with no practical use for trading.

(just a side note: English is not my native language, sorry for the mistakes in this article)

Too good to be true!

Citing Google’s conclusion:

Finally, how did we do with the data analysis? We did well: over 70% accuracy in predicting the close of the S&P 500 is the highest we’ve seen achieved on this dataset, so with few steps and a few lines of code we’ve produced a full-on machine learning model. The reason for the relatively modest accuracy achieved is the dataset itself; there isn’t enough signal there to do significantly better. But 7 times out of 10, we were able to correctly determine if the S&P 500 index would close up or down on the day, and that’s objectively good.

Reading this I immediately thought: there’s no way this can be true. If someone can predict the direction of S&P market in 7 out of 10 times, he’ll be immensely rich in just a few months. You could literally make billions out of nowhere if you have this miraculous predictor.

So I opened their iPython notebook and (while carrying a sick baby in my hands, standing near my computer to read the article while soothing the baby) just 10 minutes later I saw it: a huge red flag, a mistake you should never do when dealing with financial predictors.

It’s a data leakage, sometimes called time travelling. There are many ways how you can leak the data that your machine learning model shouldn’t know into the model. In this case, it was actually quite simple.

So where’s the problem?

Don’t worry: no programming skills, financial market knowledge or experience with deep learning libraries is necessary to understand what happened. The mistake is… well… embarrassingly simple and I was really surprised Google published that article at all.

They have a value called “snp_0” which is basically a difference between today’s closing price of S&P market and yesterday’s closing price.

And then they have input variables, such as “ftse_0” which is the same thing for London market. So again, the difference between today’s closing price and yesterday’s closing price. The London market is slightly ahead in time - it closes 4.5 hours before the S&P market closes.

The TensorFlow tries to predict just the direction of snp_0 (positive number meaning that S&P grows, while negative number means this market declines) from the inputs such as “ftse_0” described above.

Got it?

I mean: do you already see that blunder?

If not, just reread previous few paragraphs carefully and think about it. I’ll wait.

And for the people who need an explicit explanation:

  • you’ll have all the input data when London market closes.
  • you put it into your shining TensorFlow model and it will predict the direction of S&P market
  • so now you want to place your market order (buy or short the stock market based on the model’s prediction)
  • you miss only one small detail: your time machine!
  • because you need to go 19.5 hours back in time to actually place your order (snp_0 variable covers the 24 hour time period and without time machine we’re already in the last 4.5 hours of this period)
  • (and no, placing the order now will not be a good strategy. The model doesn’t predict what will happen in those last 4.5 hours. This information is actually not even in the training data.)

The aftermath

I tweeted about the data leakage yesterday:

And notified the Google guys:

Then waited for almost a day if they’ll find the error themselves and fix it (or retract it altogether).

They didn’t, so I’ve sent them specific explanation of the error into their e-mails. Half an hour later Corrie acknowledged the problem and promised they will address it on the weekend.

I hope they will fix the article thoroughly and the model will no longer leak any data through time. I’m also quite sceptical it will have any predictive power at all.

Final words

TensorFlow is a really nice library/framework.

But your beautiful deep learning library will not save you if you prepare your data incorrectly.

Unless you have a time travel machine, obviously. In that case, go ahead, use Google’s financial predictor and prosper!