# Application of Machine Learning Techniques to Trading

Auquan recently concluded another version of QuantQuest, and this time, we had a lot of people attempt Machine Learning with our problems. It was good learning for both us and them (hopefully!). This post is inspired by our observations of some common caveats and pitfalls during the competition when trying to apply ML techniques to trading problems.

IF you haven’t read our previous posts, we recommend going through our guide on building automated systems and A Systematic Approach to Developing Trading Strategies before this post.

The final output of a trading strategy should answer the following questions:

• DIRECTION: identify if an asset is cheap/expensive/fair value
• ENTRY TRADE: if an asset is cheap/expensive, should you buy/sell it
• EXIT TRADE: if an asset is fair priced and if we hold a position in that asset(bought or sold it earlier), should you exit that position
• PRICE RANGE: which price (or range) to make this trade at
• QUANTITY: Amount of capital to trade(example shares of a stock)

Machine Learning can be used to answer each of these questions, but for the rest of this post, we will focus on answering the first, Direction of trade.

### Strategy Approach

There can be two types of approaches to building strategies, model based or data mining. These are essentially opposite approaches. In model-based strategy building, we start with a model of a market inefficiency, construct a mathematical representation(eg price, returns) and test it’s validity in the long term. This model is usually a simplified representation of the true complex model and it’s long term significance and stability need to verified. Common trend-following, mean reversion, arbitrage strategies fall in this category.

On the other hand, we first look for price patterns and attempt to fit an algorithm to it in data mining approach. What causes these patterns is not important, only that patterns identified will continue to repeat in the future. This is a blind approach and we need rigorous checks to identify real patterns from random patterns. Trial-and-error TA, candle patterns, regression on a large number of features fall in this category.

Clearly, Machine Learning lends itself easily to data mining approach. Let’s look into how we can use ML to create a trade signal by data mining.

You can follow along the steps in this model using this IPython notebook. The code samples use Auquan’s python based free and open source toolbox. You can install it via pip: `pip install -U auquan_toolbox`. We use scikit learn for ML models. Install it using `pip install -U scikit-learn`.

### Using ML to create a Trading Strategy Signal — Data Mining

Before we begin, a sample ML problem setup looks like below

We create features which could have some predictive power (X), a target variable that we’d like to predict(Y) and use historical data to train a ML model that can predict Y as close as possible to the actual value. Finally, we use this model to make predictions on new data where Y is unknown. This leads to our first step:

### Step 1 — Setup your problem

What are you trying to predict? What is a good prediction? How do you evaluate

In our framework above, what is Y?

Are you predicting Price at a future time, future Return/Pnl, Buy/Sell Signal, Optimizing Portfolio Allocation, try Efficient Execution etc?
Let’s say we’re trying to predict price at the next time stamp. In that case, Y(t) = Price(t+1). Now we can complete our framework with historical data

Note Y(t) will only be known during a backtest, but when using our model live, we won’t know Price(t+1) at time t. We make a prediction Y(Predicted,t) using our model and compare it with actual value only at time t+1. This means you cannot use Y as a feature in your predictive model.

Once we know our target, Y, we can also decide how to evaluate our predictions. This is important to distinguish between different models we will try on our data. Choose a metric that is a good indicator of our model efficiency based on the problem we are solving. For example, if we are predicting price, we can use the Root Mean Square Error as a metric. Some common metrics(RMSE, logloss, variance score etc) are pre-coded in Auquan’s toolbox and available under features.

For demonstration, we’re going to use a problem from QuantQuest(Problem 1). We are going to create a prediction model that predicts future expected value of basis, where:

basis = Price of Stock — Price of Future
basis(t)=S(t)−F(t)
Y(t) = future expected value of basis = Average(basis(t+1),basis(t+2),basis(t+3),basis(t+4),basis(t+5))

Since this is a regression problem, we will evaluate the model on RMSE. We’ll also use Total Pnl as an evaluation criterion

Our Objective: Create a model so that predicted value is as close as possible to Y

### Step 2: Collect Reliable Data

Collect and clean data that helps you solve the problem at hand

You need to think about what data will have predictive power for the target variable Y? If we were predicting Price, you could use Stock Price Data, Stock Trade Volume Data, Fundamental Data, Price and Volume Data of Correlated stocks, an Overall Market indicator like Stock Index Level, Price of other correlated assets etc.

You will need to setup data access for this data, and make sure your data is accurate, free of errors and solve for missing data(quite common). Also ensure your data is unbiased and adequately represents all market conditions (example equal number of winning and losing scenarios) to avoid bias in your model. You may also need to clean your data for dividends, stock splits, rolls etc.

If you’re using Auquan’s Toolbox, we provide access to free data from Google, Yahoo, NSE and Quandl. We also pre-clean the data for dividends, stock splits and rolls and load it in a format that rest of the toolbox understands.

For our demo problem, we are using the following data for a dummy stock ‘MQK’ at minute intervals for trading days over one month(~8000 data points): Stock Bid Price, Ask Price, Bid Volume, Ask Volume Future Bid Price, Ask Price, Bid Volume, Ask Volume, StockVWAP, Future VWAP. This data is already cleaned for Dividends, Splits, Rolls.

`# Load the datafrom backtester.dataSource.quant_quest_data_source import QuantQuestDataSource`
`cachedFolderName = '/Users/chandinijain/Auquan/qq2solver-data/historicalData/'dataSetId = 'trainingData1'`
`instrumentIds = ['MQK']ds = QuantQuestDataSource(cachedFolderName=cachedFolderName,                                    dataSetId=dataSetId,                                    instrumentIds=instrumentIds)`
`def loadData(ds):    data = None    for key in ds.getBookDataByFeature().keys():        if data is None:            data = pd.DataFrame(np.nan, index = ds.getBookDataByFeature()[key].index, columns=[])        data[key] = ds.getBookDataByFeature()[key]    data['Stock Price'] =  ds.getBookDataByFeature()['stockTopBidPrice'] + ds.getBookDataByFeature()['stockTopAskPrice'] / 2.0    data['Future Price'] = ds.getBookDataByFeature()['futureTopBidPrice'] + ds.getBookDataByFeature()['futureTopAskPrice'] / 2.0    data['Y(Target)'] = ds.getBookDataByFeature()['basis'].shift(-5)    del data['benchmark_score']    del data['FairValue']    return data`
`data = loadData(ds)`

Auquan’s Toolbox has downloaded and loaded the data into a dictionary of dataframes for you. We now need to prepare the data in a format we like. The function `ds.getBookDataByFeature()` returns a dictionary of dataframes, one dataframe per feature. We create a new `data` dataframe for the stock with all the features.

### Step 3: Split Data

Create Training, Cross-Validation and Test Datasets from the data

This is an extremely important step! Before we proceed any further, we should split our data into training data to train your model and test data to evaluate model performance. Recommended split: 60–70% training and 30–40% test

Since training data is used to evaluate model parameters, your model will likely be overfit to training data and training data metrics will be misleading about model performance. If you do not keep any separate test data and use all your data to train, you will not know how well or badly your model performs on new unseen data. This is one of the major reasons why well trained ML models fail on live data — people train on all available data and get excited by training data metrics, but the model fails to make any meaningful predictions on live data that it wasn’t trained on.

There is a problem with this method. If we repeatedly train on training data, evaluate performance on test data and optimise our model till we are happy with performance we have implicitly made test data a part of training data. Eventually our model may perform well for this set of training and test data, but there is no guarantee that it will predict well on new data.

To solve for this we can create a separate validation data set. Now you can train on training data, evaluate performance on validation data, optimise till you are happy with performance, and finally test on test data. This way the test data stays untainted and we don’t use any information from test data to improve our model.

Remember once you do check performance on test data don’t go back and try to optimise your model further. If you find that your model does not give good results discard that model altogether and start fresh. Recommended split could be 60% training data, 20% validation data and 20% test data.

For our problem we have three datasets available, we will use one as training set, second as validation set and the third as our test set.

`# Training DatadataSetId =  'trainingData1'ds_training = QuantQuestDataSource(cachedFolderName=cachedFolderName,                                    dataSetId=dataSetId,                                    instrumentIds=instrumentIds)`
`training_data = loadData(ds_training)`
`# Validation DatadataSetId =  'trainingData2'ds_validation = QuantQuestDataSource(cachedFolderName=cachedFolderName,                                    dataSetId=dataSetId,                                    instrumentIds=instrumentIds)validation_data = loadData(ds_validation)`
`# Test DatadataSetId =  'trainingData3'ds_test = QuantQuestDataSource(cachedFolderName=cachedFolderName,                                    dataSetId=dataSetId,                                    instrumentIds=instrumentIds)out_of_sample_test_data = loadData(ds_test)`

To each of these, we add the target variable Y, defined as average of next five values of basis

`def prepareData(data, period):    data['Y(Target)'] = data['basis'].rolling(period).mean().shift(-period)    if 'FairValue' in data.columns:        del data['FairValue']    data.dropna(inplace=True)`
`period = 5prepareData(training_data, period)prepareData(validation_data, period)prepareData(out_of_sample_test_data, period)`

### Step 4: Feature Engineering

Analyze behavior of your data and Create features that have predictive power

Now comes the real engineering. The golden rule of feature selection is that the predictive power should come from primarily from the features and not from the model. You will find that the choice of features has a far greater impact on performance than the choice of model. Some pointers for feature selection:

• Don’t randomly choose a very large set of features without exploring relationship with target variable
• Little or no relationship with target variable will likely lead to overfitting
• Your features might be highly correlated with each other, in that case a fewer number of features will explain the target just as well
• I generally create a few features that make intuitive sense, look at correlation of target variable with those features, as well as their inter correlation to decide what to use
• You could also try ranking candidate features according to Maximal Information Coefficient (MIC), performing Principal Component Analysis(PCA) and other methods

#### Feature Transformation/Normalization:

ML models tend to perform well with normalization. However, normalization is tricky when working with time series data because future range of data is unknown. Your data could fall out of bounds of your normalization leading to model errors. Still you could try to enforce some degree of stationarity:

• Scaling: divide features by standard deviation or interquartile range
• Centering: subtract historical mean from current value
• Normalization: both of the above (x — mean)/stdev over lookback period
• Regular normalization: standardize data to the range -1 to +1 over lookback period (x-min)/(max-min) and re-center

Note since we are using historical rolling mean, standard deviation, max or min over lookback period, the same normalized value of feature will mean different actual value at different times. For example, if the current value of feature is 5 with a rolling 30-period mean of 4.5, this will transform to 0.5 after centering. Later if the rolling 30-period mean changes to 3, a value of 3.5 will transform to 0.5. This may be a cause of errors in your model; hence normalization is tricky and you have to figure what actually improves performance of your model(if at all).

If you are using our toolbox, it already comes with a set of pre coded features for you to explore.

For this first iteration in our problem, we create a large number of features, using a mix of parameters. Later we will try to see if can reduce the number of features

`def difference(dataDf, period):    return dataDf.sub(dataDf.shift(period), fill_value=0)`
`def ewm(dataDf, halflife):    return dataDf.ewm(halflife=halflife, ignore_na=False,                      min_periods=0, adjust=True).mean()`
`def rsi(data, period):    data_upside = data.sub(data.shift(1), fill_value=0)    data_downside = data_upside.copy()    data_downside[data_upside > 0] = 0    data_upside[data_upside < 0] = 0    avg_upside = data_upside.rolling(period).mean()    avg_downside = - data_downside.rolling(period).mean()    rsi = 100 - (100 * avg_downside / (avg_downside + avg_upside))    rsi[avg_downside == 0] = 100    rsi[(avg_downside == 0) & (avg_upside == 0)] = 0`
`return rsi`
`def create_features(data):    basis_X = pd.DataFrame(index = data.index, columns =  [])        basis_X['mom3'] = difference(data['basis'],4)    basis_X['mom5'] = difference(data['basis'],6)    basis_X['mom10'] = difference(data['basis'],11)        basis_X['rsi15'] = rsi(data['basis'],15)    basis_X['rsi10'] = rsi(data['basis'],10)        basis_X['emabasis3'] = ewm(data['basis'],3)    basis_X['emabasis5'] = ewm(data['basis'],5)    basis_X['emabasis7'] = ewm(data['basis'],7)    basis_X['emabasis10'] = ewm(data['basis'],10)`
`    basis_X['basis'] = data['basis']    basis_X['vwapbasis'] = data['stockVWAP']-data['futureVWAP']        basis_X['swidth'] = data['stockTopAskPrice'] -                        data['stockTopBidPrice']    basis_X['fwidth'] = data['futureTopAskPrice'] -                        data['futureTopBidPrice']        basis_X['btopask'] = data['stockTopAskPrice'] -                         data['futureTopAskPrice']    basis_X['btopbid'] = data['stockTopBidPrice'] -                         data['futureTopBidPrice']    basis_X['totalaskvol'] = data['stockTotalAskVol'] -                             data['futureTotalAskVol']    basis_X['totalbidvol'] = data['stockTotalBidVol'] -                             data['futureTotalBidVol']        basis_X['emabasisdi7'] = basis_X['emabasis7'] -                             basis_X['emabasis5'] +                              basis_X['emabasis3']        basis_X = basis_X.fillna(0)        basis_y = data['Y(Target)']    basis_y.dropna(inplace=True)        print("Any null data in y: %s, X: %s"            %(basis_y.isnull().values.any(),              basis_X.isnull().values.any()))    print("Length y: %s, X: %s"            %(len(basis_y.index), len(basis_X.index)))        return basis_X, basis_y`
`basis_X_train, basis_y_train = create_features(training_data)basis_X_test, basis_y_test = create_features(validation_data)`

### Step 5: Model Selection

Choose an appropriate statistical/ML model based on chosen problem

The choice of model will depend on the way the problem is framed. Are you solving a supervised (every point X in feature matrix maps to a target variable Y ) or unsupervised learning problem(there is no given mapping, model tries to learn unknown patterns)? Are you solving a regression (predict the actual price at a future time) or a classification problem (predict only the direction of price(increase/decrease) at a future time).

Some common supervised learning algorithms to get you started are:

I recommend starting with a simple model, for example linear or logistic regression and building up to more sophisticated models from there if needed. Also recommend reading the Math behind the model instead of blindly using it as a black box.

### Step 6: Train, Validate and Optimize (Repeat steps 4–6)

Now you’re ready to finally build your model. At this stage, you really just iterate over models and model parameters. Train your model on training data, measure it’s performance on validation data, and go back, optimize, re-train and evaluate again. If you’re unhappy with a model’s performance, try using a different model. You loop over this stage multiple times till you finally have a model that you’re happy with.

Only when you have a model who’s performance you like, proceed to the next step.

`from sklearn import linear_modelfrom sklearn.metrics import mean_squared_error, r2_score`
`def linear_regression(basis_X_train, basis_y_train,                      basis_X_test,basis_y_test):        regr = linear_model.LinearRegression()    # Train the model using the training sets    regr.fit(basis_X_train, basis_y_train)    # Make predictions using the testing set    basis_y_pred = regr.predict(basis_X_test)`
`    # The coefficients    print('Coefficients: \n', regr.coef_)        # The mean squared error    print("Mean squared error: %.2f"          % mean_squared_error(basis_y_test, basis_y_pred))        # Explained variance score: 1 is perfect prediction    print('Variance score: %.2f' % r2_score(basis_y_test,                                            basis_y_pred))`
`    # Plot outputs    plt.scatter(basis_y_pred, basis_y_test,  color='black')    plt.plot(basis_y_test, basis_y_test, color='blue', linewidth=3)`
`    plt.xlabel('Y(actual)')    plt.ylabel('Y(Predicted)')`
`    plt.show()        return regr, basis_y_pred`
`_, basis_y_pred = linear_regression(basis_X_train, basis_y_train,                                     basis_X_test,basis_y_test)`
`('Coefficients: \n', array([ -1.0929e+08, 4.1621e+07, 1.4755e+07, 5.6988e+06, -5.656e+01, -6.18e-04, -8.2541e-05,4.3606e-02, -3.0647e-02, 1.8826e+07, 8.3561e-02, 3.723e-03, -6.2637e-03, 1.8826e+07, 1.8826e+07, 6.4277e-02, 5.7254e-02, 3.3435e-03, 1.6376e-02, -7.3588e-03, -8.1531e-04, -3.9095e-02, 3.1418e-02, 3.3321e-03, -1.3262e-06, -1.3433e+07, 3.5821e+07, 2.6764e+07, -8.0394e+06, -2.2388e+06, -1.7096e+07]))`
`Mean squared error: 0.02Variance score: 0.96`

Look at the model coeffecients. We can’t really compare them or tell which ones are important since they all belong to different scale. Let’s try normalization to conform them to same scale and also enforce some stationarity.

`def normalize(basis_X, basis_y, period):    basis_X_norm = (basis_X - basis_X.rolling(period).mean())/                    basis_X.rolling(period).std()    basis_X_norm.dropna(inplace=True)    basis_y_norm = (basis_y -                     basis_X['basis'].rolling(period).mean())/                    basis_X['basis'].rolling(period).std()    basis_y_norm = basis_y_norm[basis_X_norm.index]        return basis_X_norm, basis_y_norm`
`norm_period = 375basis_X_norm_test, basis_y_norm_test = normalize(basis_X_test,basis_y_test, norm_period)basis_X_norm_train, basis_y_norm_train = normalize(basis_X_train, basis_y_train, norm_period)`
`regr_norm, basis_y_pred = linear_regression(basis_X_norm_train, basis_y_norm_train, basis_X_norm_test, basis_y_norm_test)`
`basis_y_pred = basis_y_pred * basis_X_test['basis'].rolling(period).std()[basis_y_norm_test.index] + basis_X_test['basis'].rolling(period).mean()[basis_y_norm_test.index]`
`Mean squared error: 0.05Variance score: 0.90`

The model doesn’t improve on the previous model, but it’s not much worse either. And now we can actually compare coefficients to see which ones are actually important.

Let’s look at the coefficients

`for i in range(len(basis_X_train.columns)):    print('%.4f, %s'%(regr_norm.coef_[i], basis_X_train.columns[i]))`
19.8727, emabasis4
-9.2015, emabasis5
8.8981, emabasis7
-5.5692, emabasis10
-0.0036, rsi15
-0.0146, rsi10
0.0196, mom10
-0.0035, mom5
-7.9138, basis
0.0062, swidth
0.0117, fwidth
2.0311, btopbid
0.0611, bavgbid
0.0113, topbidvolratio
0.0231, totalbidvolratio

We can clearly see that some features have a much higher coeffecient compared to others, and probably have more predictive power.

Let’s also look at correlation between different features.

`import seaborn`
`c = basis_X_train.corr()plt.figure(figsize=(10,10))seaborn.heatmap(c, cmap='RdYlGn_r', mask = (np.abs(c) <= 0.8))plt.show()`

The areas of dark red indicate highly correlated variables. Let’s create/modify some features again and try to improve our model.

For example, I can easily discard features like emabasisdi7 that are just a linear combination of other features

`def create_features_again(data):    basis_X = pd.DataFrame(index = data.index, columns =  [])    basis_X['mom10'] = difference(data['basis'],11)`
`    basis_X['emabasis2'] = ewm(data['basis'],2)    basis_X['emabasis5'] = ewm(data['basis'],5)    basis_X['emabasis10'] = ewm(data['basis'],10)`
`    basis_X['basis'] = data['basis']    basis_X['totalaskvolratio'] = (data['stockTotalAskVol']                                 - data['futureTotalAskVol'])/                                   100000    basis_X['totalbidvolratio'] = (data['stockTotalBidVol']                                 - data['futureTotalBidVol'])/                                   100000`
`    basis_X = basis_X.fillna(0)        basis_y = data['Y(Target)']    basis_y.dropna(inplace=True)`
`    return basis_X, basis_y`
`basis_X_test, basis_y_test = create_features_again(validation_data)basis_X_train, basis_y_train = create_features_again(training_data)_, basis_y_pred = linear_regression(basis_X_train, basis_y_train, basis_X_test,basis_y_test)`
`basis_y_regr = basis_y_pred.copy()`
`('Coefficients: ', array([ 0.03246139,0.49780982, -0.22367172,  0.20275786,  0.50758852,-0.21510795, 0.17153884]))`
`Mean squared error: 0.02Variance score: 0.96`

See, our model performance does not change, and we only need a few features to explain our target variable. I recommend playing with more features above, trying new combinations etc to see what can improve our model.

We can also try more sophisticated models to see if change of model may improve performance

#### K Nearest Neighbours

`from sklearn import neighborsn_neighbors = 5`
`model = neighbors.KNeighborsRegressor(n_neighbors, weights='distance')model.fit(basis_X_train, basis_y_train)basis_y_pred = model.predict(basis_X_test)basis_y_knn = basis_y_pred.copy()`

#### SVR

`from sklearn.svm import SVR`
`model = SVR(kernel='rbf', C=1e3, gamma=0.1)`
`model.fit(basis_X_train, basis_y_train)basis_y_pred = model.predict(basis_X_test)basis_y_svr = basis_y_pred.copy()`

#### Decision Trees

`model=ensemble.ExtraTreesRegressor()model.fit(basis_X_train, basis_y_train)basis_y_pred = model.predict(basis_X_test)basis_y_trees = basis_y_pred.copy()`

### Step 7: Backtest on Test Data

Check for performance of Real Out of Sample Data

This is the moment of truth. We run our final, optimized model from last step on that Test Data that we had kept aside at the start and did not touch yet.

This provides you with realistic expectation of how your model is expected to perform on new and unseen data when you start trading live. Hence, it is necessary to ensure you have a clean dataset that you haven’t used to train or validate your model.

If you don’t like the results of your backtest on test data, discard the model and start again. DO NOT go back and re-optimize your model, this will lead to over fitting! (Also recommend to create a new test data set, since this one is now tainted; in discarding a model, we implicitly know something about the dataset).

For backtesting, we use Auquan’s Toolbox

`import backtesterfrom backtester.features.feature import Featurefrom backtester.trading_system import TradingSystemfrom backtester.sample_scripts.fair_value_params import FairValueTradingParams`
`class Problem1Solver():`
`def getTrainingDataSet(self):        return "trainingData1"`
`def getSymbolsToTrade(self):        return ['MQK']`
`def getCustomFeatures(self):        return {'my_custom_feature': MyCustomFeature}`
`def getFeatureConfigDicts(self):                                    expma5dic = {'featureKey': 'emabasis5',                 'featureId': 'exponential_moving_average',                 'params': {'period': 5,                              'featureName': 'basis'}}        expma10dic = {'featureKey': 'emabasis10',                 'featureId': 'exponential_moving_average',                 'params': {'period': 10,                              'featureName': 'basis'}}                             expma2dic = {'featureKey': 'emabasis3',                 'featureId': 'exponential_moving_average',                 'params': {'period': 3,                              'featureName': 'basis'}}        mom10dic = {'featureKey': 'mom10',                 'featureId': 'difference',                 'params': {'period': 11,                              'featureName': 'basis'}}                return [expma5dic,expma2dic,expma10dic,mom10dic]            def getFairValue(self, updateNum, time, instrumentManager):        # holder for all the instrument features        lbInstF = instrumentManager.getlookbackInstrumentFeatures()        mom10 = lbInstF.getFeatureDf('mom10').iloc[-1]        emabasis2 = lbInstF.getFeatureDf('emabasis2').iloc[-1]        emabasis5 = lbInstF.getFeatureDf('emabasis5').iloc[-1]        emabasis10 = lbInstF.getFeatureDf('emabasis10').iloc[-1]         basis = lbInstF.getFeatureDf('basis').iloc[-1]        totalaskvol = lbInstF.getFeatureDf('stockTotalAskVol').iloc[-1] - lbInstF.getFeatureDf('futureTotalAskVol').iloc[-1]        totalbidvol = lbInstF.getFeatureDf('stockTotalBidVol').iloc[-1] - lbInstF.getFeatureDf('futureTotalBidVol').iloc[-1]                coeff = [ 0.03249183, 0.49675487, -0.22289464, 0.2025182, 0.5080227, -0.21557005, 0.17128488]        newdf['MQK'] = coeff[0] * mom10['MQK'] + coeff[1] * emabasis2['MQK'] +\                      coeff[2] * emabasis5['MQK'] + coeff[3] * emabasis10['MQK'] +\                      coeff[4] * basis['MQK'] + coeff[5] * totalaskvol['MQK']+\                      coeff[6] * totalbidvol['MQK']                            newdf.fillna(emabasis5,inplace=True)        return newdf`
`problem1Solver = Problem1Solver()tsParams = FairValueTradingParams(problem1Solver)tradingSystem = TradingSystem(tsParams)tradingSystem.startTrading(onlyAnalyze=False,                            shouldPlot=True,                           makeInstrumentCsvs=False)`

### Step 8: Other ways to improve model

Rolling Validation, Ensemble Learning, Bagging, Boosting

Besides collecting more data, creating better features or trying more models, there’s a few things you can try to train your model better.

1. Rolling Validation

Market conditions rarely stay same. Let’s say you have data for a year and you use Jan-August to train and Sep-Dec to test your model, you might end up training over a very specific set of market conditions. Maybe there was no market volatility for first half of the year and some extreme news caused markets to move a lot in September, your model will not learn this pattern and give you junk results.

It might be better to try a walk forward rolling validation — train over Jan-Feb, validate over March, re-train over Apr-May, validate over June and so on.

2. Ensemble Learning

Some models may work well in prediction certain scenarios and other in prediction other scenarios. Or a model may be extremely overfitting in a certain scenario. One way of reducing error and overfitting both is to use an ensemble of different model. Your prediction is the average of predictions made by many model, with errors from different models likely getting cancelled out or reduced. Some common ensemble methods are Bagging and Boosting.

To keep this post short, I will skip these methods, but you can read more about them here.

Let’s try an ensemble method for our problem

`basis_y_pred_ensemble = (basis_y_trees + basis_y_svr +                         basis_y_knn + basis_y_regr)/4`
Mean squared error: 0.02
Variance score: 0.95

All the code for the above steps is available in this IPython notebook. You can read more below:

### That was quite a lot of information. Let’s do a quick Recap:

• Collect reliable Data and clean Data
• Split Data into Training, Validation and Test sets
• Create Features and Analyze Behavior
• Choose an appropriate training model based on Behavior
• Use Training Data to train your model to make predictions
• Check performance on validation set and re-optimize
• Verify final performance on Test Set

Phew! But that’s not it. You only have a solid prediction model now. Remember what we actually wanted from our strategy? You still have to:

• Develop Signal to identify trade direction based on prediction model
• Develop Strategy to identify Entry/Exit Points
• Execution System to identify Sizing and Price

Important Note on Transaction Costs: Why are the next steps important? Your model tells you when your chosen asset is a buy or sell. It however doesn’t take into account fees/transaction costs/available trading volumes/stops etc. Transaction costs very often turn profitable trades into losers. For example, an asset with an expected \$0.05 increase in price is a buy, but if you have to pay \$0.10 to make this trade, you will end up with a net loss of -\$0.05. Our own great looking profit chart above actually looks like this after you account for broker commissions, exchange fees and spreads:

Transaction fees and spreads take up more than 90% of our Pnl! We will discuss these in detail in a follow-up post.

Finally, let’s look at some common pitfalls.

### DO’s and DONT’s

• AVOID OVERFITTING AT ALL COSTS!
• Don’t retrain after every datapoint: This was a common mistake people made in QuantQuest. If your model needs re-training after every datapoint, it’s probably not a very good model. That said, it will need to be retrained periodically, just at a reasonable frequency (example retraining at the end of every week if making intraday predictions)
• Avoid biases, especially lookahead bias: This is another reason why models don’t work — Make sure you are not using any information from the future. Mostly this means, don’t use the target variable, Y as a feature in your model. This is available to you during a backtest but won’t be available when you run your model live, making your model useless.
• Be wary of data mining bias: Since we are trying a bunch of models on our data to see if anything fits, without an inherent reason behind it fits, make sure you run rigorous tests to separate random patterns from real patterns which are likely to occur in the future. For example what might seem like an upward trending pattern explained well by a linear regression may turn out to be a small part of a larger random walk!

### Avoid Overfitting

This is so important, I feel the need to mention it again.

• Overfitting is the most dangerous pitfall of a trading strategy
• A complex algorithm may perform wonderfully on a backtest but fails miserably on new unseen data —this algorithm has not really uncovered any trend in data and no real predictive power. It is just fit very well to the data it has seen
• Keep your systems as simple as possible. If you find yourself needing a large number of complex features to explain your data, you are likely over fitting
• Divide your available data into training and test data and always validate performance on Real Out of Sample data before using your model to trade live.

Webinar Video: If you prefer listening to reading and would like to see a video version of this post, you can watch this webinar link instead.