Can train/test split help standard econometrics?

4 min readMar 23, 2018

Imagine we are presented with the following data-set:

Our dependent variable is the price of coffee (“coffee_p”) and we are trying to explain this variable with a dummy of some event happening (perhaps firm collusion period) and the price of oil (“oil_p”). Including oil prices may seem a bit strange, but perhaps not? Maybe they help proxy for cost of raw materials or energy or transport, etc?

Let’s run a linear regression:

y = samples_df['coffee_p'].values
X = samples_df[['event_dummy', 'oil_p']].valuesmod = sm.OLS(y, X)
res = mod.fit()

Our regression summary:

Great! The p-stat suggests our results are strongly significant at the 0.001 and we have a high R-squared value suggesting we can explain a lot of the y-variable with our data.

But, this is the True Data Generating Process:

samples_df = pd.DataFrame({
    'coffee_p': np.cumsum(np.random.uniform(-1,1,(obs))),
    'oil_p': np.cumsum(np.random.uniform(-1,1,(obs))),
    'event_dummy': np.concatenate(
        ([0]*int(num_samples*1/4),
         [1]*int(num_samples*1/2), 
         [0]*int(num_samples*1/4)), axis=0)})

The data is just noise ….

The Durbin-Watson statistic is nearly 0, which suggests very strong serial auto-correlation, violating the assumptions for our OLS regression. The fix this we can take first-differences:

samples_df_fd = pd.DataFrame({
    'coffee_p': samples_df[['coffee_p']].diff()[1:].values.squeeze(),
    'oil_p': samples_df[['oil_p']].diff()[1:].values.squeeze(),
    'event_dummy': samples_df[['event_dummy']][1:].values.squeeze()})

This time we can see our data is stationary:

This time we get the expected results, the explanatory variables are just random noise and explain nothing:

Instead of looking at the Durbin-Watson statistic to realise we had serial autocorrelation, could we have just done this?

Consider a standard machine-learning pipeline

(without realising we need to make the series stationary)

y = samples_df['coffee_p'].values
X = samples_df[['event_dummy', 'oil_p']].valuey_train, y_test = y[:int(obs*0.7)], y[int(obs*0.7):]
X_train, X_test = X[:int(obs*0.7)], X[int(obs*0.7):]# Run sklearn linear regression
reg = linear_model.LinearRegression()
reg.fit(X=X_train, y=y_train)prediction = reg.predict(X=X_test)# MSE? 210
np.sum((y_test-prediction)**2)/len(prediction)

If we plot our prediction on test-data we can already see that it’s not possible to predict the data and avoid the mistake we could fall into with the initial method:

Thanks to Riemer Faber who first introduced me to this.

Of course there are counter-examples, such as this fable:

Once upon a time, the US Army wanted to use neural networks to automatically detect camouflaged enemy tanks. The researchers trained a neural net on 50 photos of camouflaged tanks in trees, and 50 photos of trees without tanks. Using standard techniques for supervised learning, the researchers trained the neural network to a weighting that correctly loaded the training set — output “yes” for the 50 photos of camouflaged tanks, and output “no” for the 50 photos of forest. This did not ensure, or even imply, that new examples would be classified correctly. The neural network might have “learned” 100 special cases that would not generalize to any new problem. Wisely, the researchers had originally taken 200 photos, 100 photos of tanks and 100 photos of trees. They had used only 50 of each for the training set. The researchers ran the neural network on the remaining 100 photos, and without further training the neural network classified all remaining photos correctly. Success confirmed! The researchers handed the finished work to the Pentagon, which soon handed it back, complaining that in their own tests the neural network did no better than chance at discriminating photos.
It turned out that in the researchers’ dataset, photos of camouflaged tanks had been taken on cloudy days, while photos of plain forest had been taken on sunny days. The neural network had learned to distinguish cloudy days from sunny days, instead of distinguishing camouflaged tanks from empty forest.

So I’m curious what you think?

Can train/test split help standard econometrics?

Written by Ilia Karmanov