Notes on the G-research competition

In this post, I’ll explain my approach to the G-research financial forecasting challenge, a Kaggle-style machine learning competition. I was busy with other projects during my free time, and have been able to work on it only for the last two weeks of the competition. Given the little time available, I chose to focus on feature engineering rather than model building. My submission ended up at the 23rd place on the private leaderboard (among about 400 entries). This was my first Kaggle-like competition, so feel free to comment if you think there are things I should have done differently.

The Jupyter notebooks for data exploration and feature engineering can be found in this GitHub repository.

Dataset

According to the challenge website, the response is “an element of the return of a financial time series”. This is probably the returns of an actively managed portfolio, adjusted for common risk factors.

The features contain integers referring to the day the return was achieved, the stock that was traded and the market in which it was traded. It is harder to guess what the other features are. According to the challenge website:

- x0…x3E: Predictors that relate to the observed behaviour of the instrument on the day in question. The features labelled ‘x3A’, ‘x3B’, etc. are strongly related and might be expected to be highly correlated.
- x4…x6: Predictors that describe the ‘typical’ behaviour that we would expect of the instrument on that day.

The loss function is a weighted squared error, with the weights being provided only in the train set.

Data exploration

Let’s dive into the dataset. Some general properties are as follows.

Train dataset:
Number of data points: 623817
Number of markets: 4
Number of days: 258
Number of stocks: 3022
Number of features: 11
Sparsity of the training data: 3.36374745623
Test dataset:
Number of samples: 640430
Number of markets: 4
Number of days: 258
Number of stocks: 3017
Number of features: 11
Sparsity of the training data: 3.37179421103

The sparsity is exp(log(#samples)/#features) and has the following interpretation. Suppose your samples fit in a hypercube of dimension #features, and that you’re cutting the hypercube along each axis to get smaller hypercubes containing each a single sample. The sparsity tells you how many subdivision you need per axis. The lower it is, the harder you’ll be hit by the curse of dimensionality. It gives an idea of the density of the samples adjusted for the number of features. Despite the large number of samples and reasonable number of features, we only need to cut the hypercube between 3 and 4 times along each axis on average to isolate each sample. This suggests that we’ll need regularization. (This is of course a very rough analysis ignoring any correlation between the features.)

Inspection also shows that each stock belongs to a single market. Here is a count of the days and stocks per market in the train and test set:

Count train: 
Day Stock
Market
1 249 452
2 251 546
3 248 1793
4 256 231
----------
Count test:
Day Stock
Market
1 253 452
2 255 544
3 256 1790
4 255 231

Let us now plot the days for which we have data in the train set, in the test set and in either of them:

Days with available data (in yellow) in the train set, the test set and either of them. The data is arranged in rows of 7 days to allow the identification of the day of the week.

The weekends, where no trading takes place, are clearly recognizable, which allows us to map each day to a day of the week. A priori, this is interesting, because financial time series often display seasonality. A posteriori, the corresponding features turned out not to be very predictive. We also notice that the split between the train and test set is by day. This is very important, because we will have to split the validation set off the train set by day as well, in order to be able to predict meaningfully the test score.

A puzzling fact is that the train and test sets are temporally mixed. This is definitely not how one would design and test a model meant for production, where the past is known but the future is unknown. On the other hand, given the notorious non-stationarity of financial time series, maybe you would not be able to predict anything with a proper time split between the train and test set. This fact allows for some “cheating” in the feature engineering, in which we can construct predictive features on the test set that contain information about the future responses, weights and features.

Turning to stocks, we can explore how many days they are traded in the train and in the test set:

Train:
Day count:
count 3022.000000
mean 206.425215
std 71.953032
min 1.000000
25% 188.000000
50% 247.000000
75% 247.000000
max 256.000000
Test:
Day count:
count 3017.000000
mean 212.273782
std 74.141465
min 1.000000
25% 195.000000
50% 255.000000
75% 256.000000
max 256.000000

Let us now look at the weights:

mean         14.723355
std 24.390745
min 0.002797
25% 2.271488
50% 6.144597
75% 16.649687
max 694.001930

We see a large scale range. We will have to use the log weight in feature engineering. Same statistics for the response:

mean          0.000075
std 0.001047
min -0.071099
25% -0.000107
50% 0.000027
75% 0.000203
max 0.066617

The response seems distributed more or less symmetrically around zero.

Average response by market
Top: average response by day. Bottom: Compounded return, assuming the response is a return. That’s an impressive Sharpe ratio…

We now turn to the features. They look very much like the weight: positive, with even more extreme values. This calls for applying a log before using them. The problem is that some values are exactly zero, which puzzled me a lot. What quantity can you measure that will give you positive values in a range spanning 5 orders of magnitude, together with exactly vanishing values? So I decided to treat vanishing values as missing values, set them to the median and taking the log of all the features. After taking the log, you get nice Gaussian looking distributions with a bit of skew, looking more or less all like the one below (look at the notebook).

Distribution of log(x0) on the train set

A notable exception is x6, which takes the value 100 a disproportionate number of times (as well as other round values such as 200 or 150).

Another important fact is noticeable when plotting the features for a stock by day. Most features look like this:

Log(x2) on the train set, by day, for stock #467, the stock with the highest average weight. Missing values are interpolated.

They seem to be temporally uncorrelated. However, you get the following graph for log(x4):

Log(x4) on the train set, by day, for stock #467, the stock with the highest average weight. Missing values are interpolated.

log(x4) seems integrated, with strong temporal correlations for short time intervals. This will be important during feature engineering when we will have to fill missing values in moving averages. The weight and response did not show signs of temporal correlation (visually speaking; I did not properly test this).

Another surprise appears when boxplotting the (log) features in the train and test set:

log(x0), log(x5) and log(x3A…E) display some extreme values. There are various ways of dealing with such extreme value. I added binary features taking value 1 if the feature was below -40 and 0 else. This is absolutely necessary if we plan to perform linear regression, for instance. As we will be using trees, which are not so sensitive to such extreme values, this is not so crucial, and indeed, these extra features did not prove to be very predictive. One could have alternatively considered them as measure error and set them to the sample median. I didn’t have time to compare these strategies. It should also be emphasized that the number of outliers is rather low, so the way of dealing with them should not have a big influence on the result — provided one use outlier-resilient methods such as trees.

The correlation heatmap shows that the features x3A..E are very correlated. Again not really a problem for trees.

Correlation heat map for the features (log) x0, x1, x2, x3A, x3B, x3C, x3D, x3E, x4, x5, x6.

I also checked that the boxplot and the correlations are stable across the time axis.

Feature engineering

As the data is time ordered, it is not convenient to have it in the form of a train set and a test set for feature engineering. I merged the two datasets into one and indexed it by market, stock and day.

The merged dataset has missing values, for instance because some stocks are not traded on some days. As we will compute moving averages/medians, we need to fill this missing data somehow. There are essentially two ways to do that, we can fill along the time axis or cross-sectionally. I adopted the second method, using the cross-sectional median, as we saw that some features have extreme values. I was also careful to compute cross-sectional averages by markets. It is quite possible that some of the markets are commodity/derivative/forex markets, which presumably may have quite a different behavior from stocks. It wouldn’t make sense to take medians across markets. The response and weight were filled similarly on the days belonging to the test set.

An exception to the filling rule above is x4. Recall that x4 is integrated, displaying strong short term temporal correlations. It wouldn’t make sense to fill it using a cross-sectional average/median. Rather I filled it by interpolation or constant extrapolation.

I replaced null or negative feature values in the same way and took the log of all the features. From now on, when I refer to feature x, I really mean log(x), and similarly for the weight. I then built the following features.

Moving averages of the weights and responses: I constructed the response and weight moving averages 10, 20 and 40 days, both forward-looking and backward-looking. Some remarks:

  • It is crucial to take these averages excluding the present value, else we would be leaking the response into the features. If we were to use recurrent neural networks, we would also have to be more careful with such features for the same reason.
  • I used relatively large windows to make sure that each day in the test set had a reasonable number of original (as opposed to filled) responses and weights in its window.
  • This is where Kaggle-like competition differ from real world problems. Should we really care about building a model able to predict the future from the past, we should not include future moving averages.
  • It turns out that those were among the most predictive features. But I was very surprised not to see an obvious difference between the future and past moving averages. You’d expect it to be more easy to predict the present knowing the future than knowing the past. But it didn’t seem to be the case.

Difference of x4: The difference of x4 is dx4(t) = x4(t)-x4(t-1). This is a standard method of dealing with integrated time series: difference them to obtain a non-integrated one. As with moving averages, I “cheated” and also included the forward looking difference.

Moving medians for all the features: I constructed the forward- and backward-looking moving medians with windows 3, 10 and 20 days for all the features. No need to exclude the present value in this case. I used medians instead of means because we saw that some features had rather extreme values to which means would be oversensitive.

Temporal robust z-scores: I computed a robust z-score for each stock and each feature along the time axis. Instead of the standard z-score, I computed the median over the interquartile range, again because of the extreme values of some features. Note again that if we really cared about predicting the future from the past, we would need to restrict these z-scores on the past values only to avoid look-ahead bias.

Cross-sectional robust z-scores: Similarly, I computed a cross-sectional robust z-score for each market and for each day.

Beta of the stocks: I assumed that the response was really a return, and computed the beta of each stock, i.e. essentially its correlation with the cross-sectional market return. Now I realize I should have computed the beta market by market.

Days of the week: In the data exploration, we were able to assign a day of the week to each day. I one-hot encoded those. It is common for financial time series to display seasonality on various time scales, but these feature did not turn out to be very predictive.

Extreme value: I also added a binary feature for each original feature taking value 1 if the (log) feature takes a value smaller than -40, and 0 else. These features were useful for the simple linear regression baseline, not so much for models using trees.

I split off the validation set by picking randomly 20% of the days in the train set, being careful to respect the proportion of the days of the week, should there be any weekly seasonal effect.

Model building

As already mentioned, this will be very basic. Better results can probably be obtained by stacking dozens of models.

The models were built using sklearn. Because the response is rather small, of the order 10e-5, the models generally don’t train very well with default parameters. To address this, I just multiplied the response by 1000, trained the models and divided all the predictions by 1000. Plain linear regression yielded a score of 0.440569, around rank 300 (on 400). Then I went for gradient boosting and random forest, with random forests achieving a slightly better score. After some play with the hyperparameters, I noticed that using deeper trees in the gradient boosting (with some regularization) was giving better results on the train set (expected, the model is just overfitting), but strangely enough, also on the validation set. Note that officially this is not what you’re supposed to do: boosting uses low-variance, high bias weak predictors and adjust the sample weights to make them focus on badly forecasted samples. Using strong predictors in a boosting algorithm almost guarantees overfitting.

I ended up with a model of 2500 trees of depth 12 achieving 0.111474 on the train set and 0.325448 on the validation set. (For reference the winning entry achieved 0.282720 on the private test set.) The “deep” gradient boosting model was very badly overfitting the training set, as expected, yet it was performing markedly better than my other non-overfit models on the validation set.

The model ended up with 0.301747 on the public test set and 0.291949 on the private test set. Comparing my validation score and my public test score, I had a worry that I was just getting lucky on the public test set when I submitted my final entry. It turned out that it wasn’t the case, as I jumped from 36th on the public test set to 23th on the private test set.

It’s a good reminder that there is nothing wrong with overfitting, as long as you’re aware of it and don’t take the overfit performance as your expectation for out-of-sample performance.

Finally, here are other ideas I couldn’t try for lack of time:

  • Use unsupervised machine learning along the temporal or cross-sectional axis, to cluster similar days or stocks, and one-hot encode the cluster number to create new features.
  • Plot training and validation losses as functions of the hyperparameters to tune them properly.
  • Try RNNs, although returns series seem to be generally too noisy for deep neural networks.